Skip to content

Commit 7f8612d

Browse files
drappier-charlescploujouxdevin-ai-integration[bot]
authored
Add keepAlive flag to prevent scale-to-zero during process execution (#176)
* Add lifecycle management for sandbox processes Introduces new lifecycle features to the sandbox API, including the ability to force stop processes and retrieve the current sandbox status. The `/stop` endpoint allows for immediate or scheduled removal of the keepAlive flag from running processes, enabling auto-hibernation. The `/status` endpoint provides information on the current state of the sandbox and active keepAlive processes. Additionally, integrates the lifecycle management with the MCP tools, enhancing the overall process control and monitoring capabilities. The scale-to-zero functionality is also improved with crash recovery mechanisms. Updates include: - New `LifecycleHandler` for managing lifecycle operations. - API documentation updates for new endpoints. - Integration of keepAlive functionality in process management. - Comprehensive tests for lifecycle features and MCP integration. * Enhance scale-to-zero functionality with improved error handling and logging * Remove lifecycle management endpoints and related functionality This commit removes the `/stop` and `/status` endpoints from the sandbox API, along with the associated `LifecycleHandler` and related data structures. The lifecycle management features, including the ability to force stop processes and retrieve the current sandbox status, have been deprecated. Updates include: - Deletion of lifecycle-related API routes and handlers. - Removal of lifecycle management types and structures from the codebase. - Adjustments to documentation to reflect the removal of these features. This change simplifies the API and focuses on core process management functionalities. * Fix PR * Fix bug introduce with AI check * Implement keepAlive timeout handling for restarted processes This commit introduces functionality to manage timeouts for processes with the keepAlive flag enabled. If a process is restarted and keepAlive is active with a specified timeout, a goroutine is initiated to monitor the timeout and kill the process if it exceeds the limit. For processes with an infinite timeout, the goroutine simply waits for the process to complete. This enhancement improves process management and ensures better resource handling. * Fix some recommendation from AI * Enhance logging in ProcessManager to include process name in scale-to-zero warnings and timeout messages. Clear KeepAlive state before killing processes to prevent double ScaleEnable calls. Refactor related log messages for consistency. * Refactor logging in ProcessManager to use structured log entries for KeepAlive events. Enhance clarity by including process details in log messages for scale-to-zero operations and timeout handling. This improves consistency and debuggability of process management logs. * Sanitize user-provided values in log entries to prevent log injection (CWE-117) Add sanitizeLogValue() helper that escapes newlines and control characters. Replace structured logging (logrus.WithFields) with simple logrus.Infof/Warnf calls that use sanitized values - clearer and more readable. Addresses all CodeQL 'Log entries created from user input' warnings. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Improve sanitizeLogValue to strip all control characters (CWE-117) Replace strings.NewReplacer with byte-level filtering that strips all control characters (< 0x20) including newlines. This provides more thorough sanitization against log injection. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Use strings.NewReplacer as CodeQL-recognized sanitizer for log injection (CWE-117) Replace custom byte-level loop with package-level strings.NewReplacer variable (logSanitizer). CodeQL explicitly recognizes strings.Replacer.Replace as a sanitizer for go/log-injection since github/codeql#11910. Call logSanitizer.Replace() directly at each log site. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Use %q format directive for user-provided values in log entries (CWE-117) CodeQL's SafeFormatArgumentSanitizer explicitly recognizes %q as safe because it escapes newline characters. This is the simplest and most idiomatic fix - no helper functions or variables needed. Removes the logSanitizer variable entirely. User-provided values (name, command) are now logged with %q which produces Go-syntax quoted strings, making any control characters visible in the output. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Revert sanitize/log changes - restore original structured logging Reverts commits 6902a62, 16773fb, 4eeb6f7, 3a5b776 which added log sanitization for CWE-117. Keeps all other agent recommendations (structured logging with logrus.WithFields, race condition fix, etc). Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Fix StopProcess keepAlive leak and KeepAlive data race - StopProcess (SIGTERM path) now clears KeepAlive and calls ScaleEnable, preventing the scale-to-zero counter from leaking when a keepAlive process is gracefully stopped. - All reads/writes of process.KeepAlive are now synchronized via pm.mu: KillProcess and StopProcess write under Lock(), completion goroutines read under RLock(). This eliminates the data race between the kill/stop path and the completion goroutine. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Fix data race on oldProcess.KeepAlive read in restartProcess Protect the read of oldProcess.KeepAlive and oldProcess.Timeout in restartProcess with pm.mu.RLock(), matching the synchronization used by KillProcess/StopProcess which write under pm.mu.Lock(). Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> --------- Co-authored-by: cploujoux <ch.ploujoux@gmail.com> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
1 parent 0ac40b9 commit 7f8612d

File tree

10 files changed

+538
-30
lines changed

10 files changed

+538
-30
lines changed

sandbox-api/docs/LIFECYCLE.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Process KeepAlive
2+
3+
This document describes the process `keepAlive` feature that controls the sandbox's auto-hibernation behavior.
4+
5+
## Overview
6+
7+
When launching a process with `keepAlive: true`, the sandbox will stay awake (auto-hibernation disabled) until the process completes or times out.
8+
9+
## How It Works
10+
11+
The sandbox uses a **counter-based scale-to-zero** system:
12+
13+
| Counter | Effect |
14+
|---------|--------|
15+
| > 0 | Auto-hibernation disabled (sandbox stays awake) |
16+
| = 0 | Auto-hibernation enabled (sandbox can sleep) |
17+
18+
When a `keepAlive` process starts, the counter is incremented. When it ends, the counter is decremented.
19+
20+
## API Usage
21+
22+
**POST /process**
23+
24+
```json
25+
{
26+
"command": "npm run dev",
27+
"workingDir": "/app",
28+
"keepAlive": true,
29+
"timeout": 600
30+
}
31+
```
32+
33+
## Parameters
34+
35+
| Parameter | Type | Default | Description |
36+
|-----------|------|---------|-------------|
37+
| `keepAlive` | boolean | `false` | When true, disables auto-hibernation while the process runs |
38+
| `timeout` | integer | `600` | Timeout in seconds. Set to `0` for infinite (no timeout) |
39+
40+
## Timeout Behavior
41+
42+
- **timeout > 0**: Process will be automatically killed after the specified seconds
43+
- **timeout = 0**: Process runs indefinitely (infinite timeout)
44+
- **timeout not specified with keepAlive=true**: Defaults to 600 seconds (10 minutes)
45+
46+
## Logging
47+
48+
All keepAlive events are logged:
49+
50+
```
51+
[KeepAlive] Started process 12345 (name: my-app, command: npm run dev) with timeout 600s
52+
[KeepAlive] Stopped process 12345 (name: my-app, status: completed, exit_code: 0)
53+
```
54+
55+
Scale operations:
56+
57+
```
58+
[Scale] Disabled scale-to-zero (wrote '+', counter now: 1) - sandbox staying AWAKE
59+
[Scale] Enabled scale-to-zero (wrote '-', counter now: 0) - sandbox can AUTO-HIBERNATE
60+
```
61+
62+
## Crash Recovery
63+
64+
On startup, the sandbox-api resets the scale-to-zero counter to 0, ensuring the sandbox doesn't get stuck awake if the API crashed while keepAlive processes were running.
65+
66+
## Examples
67+
68+
### Run a dev server that keeps sandbox awake
69+
70+
```bash
71+
curl -X POST http://localhost:8080/process \
72+
-H "Content-Type: application/json" \
73+
-d '{"command": "npm run dev", "workingDir": "/app", "keepAlive": true}'
74+
```
75+
76+
### Run a process with infinite timeout
77+
78+
```bash
79+
curl -X POST http://localhost:8080/process \
80+
-H "Content-Type: application/json" \
81+
-d '{"command": "python server.py", "keepAlive": true, "timeout": 0}'
82+
```

sandbox-api/docs/docs.go

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2185,6 +2185,11 @@ const docTemplate = `{
21852185
"{\"PORT\"": " \"3000\"}"
21862186
}
21872187
},
2188+
"keepAlive": {
2189+
"description": "Disable scale-to-zero while process runs. Default timeout is 600s (10 minutes). Set timeout to 0 for infinite.",
2190+
"type": "boolean",
2191+
"example": false
2192+
},
21882193
"maxRestarts": {
21892194
"type": "integer",
21902195
"example": 3
@@ -2198,6 +2203,7 @@ const docTemplate = `{
21982203
"example": true
21992204
},
22002205
"timeout": {
2206+
"description": "Timeout in seconds. When keepAlive is true, defaults to 600s (10 minutes). Set to 0 for infinite (no auto-kill).",
22012207
"type": "integer",
22022208
"example": 30
22032209
},
@@ -2249,6 +2255,11 @@ const docTemplate = `{
22492255
"type": "integer",
22502256
"example": 0
22512257
},
2258+
"keepAlive": {
2259+
"description": "Whether scale-to-zero is disabled for this process",
2260+
"type": "boolean",
2261+
"example": false
2262+
},
22522263
"logs": {
22532264
"type": "string",
22542265
"example": "logs output"

sandbox-api/docs/openapi.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1772,6 +1772,10 @@ components:
17721772
example:
17731773
"{\"PORT\"": ' "3000"}'
17741774
type: object
1775+
keepAlive:
1776+
description: Disable scale-to-zero while process runs. Default timeout is 600s (10 minutes). Set timeout to 0 for infinite.
1777+
example: false
1778+
type: boolean
17751779
maxRestarts:
17761780
example: 3
17771781
type: integer
@@ -1782,6 +1786,7 @@ components:
17821786
example: true
17831787
type: boolean
17841788
timeout:
1789+
description: Timeout in seconds. When keepAlive is true, defaults to 600s (10 minutes). Set to 0 for infinite (no auto-kill).
17851790
example: 30
17861791
type: integer
17871792
waitForCompletion:
@@ -1811,6 +1816,10 @@ components:
18111816
exitCode:
18121817
example: 0
18131818
type: integer
1819+
keepAlive:
1820+
description: Whether scale-to-zero is disabled for this process
1821+
example: false
1822+
type: boolean
18141823
logs:
18151824
example: logs output
18161825
type: string

sandbox-api/main.go

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ import (
1515
"github.com/blaxel-ai/sandbox-api/docs" // swagger generated docs
1616
"github.com/blaxel-ai/sandbox-api/src/api"
1717
"github.com/blaxel-ai/sandbox-api/src/handler/process"
18+
"github.com/blaxel-ai/sandbox-api/src/lib/blaxel"
1819
"github.com/blaxel-ai/sandbox-api/src/lib/networking"
1920
"github.com/blaxel-ai/sandbox-api/src/mcp"
2021
"github.com/gin-gonic/gin"
@@ -45,6 +46,13 @@ func main() {
4546
logrus.WithError(err).Warn("WireGuard initialization failed - the sandbox will NOT have outbound internet connectivity (no egress). Inbound connections to the sandbox will still work. You can check the tunnel status via the /network/tunnel endpoints.")
4647
}
4748

49+
// Reset scale-to-zero counter on startup (crash recovery)
50+
// If sandbox-api crashed while keepAlive processes were running,
51+
// the counter would be left in a bad state - this resets it to 0
52+
if err := blaxel.ScaleReset(); err != nil {
53+
logrus.Warnf("Failed to reset scale-to-zero counter on startup: %v", err)
54+
}
55+
4856
ctx, cancel := context.WithCancel(context.Background())
4957
defer cancel()
5058

sandbox-api/src/handler/process.go

Lines changed: 30 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -52,10 +52,11 @@ type ProcessRequest struct {
5252
WorkingDir string `json:"workingDir" example:"/home/user"`
5353
Env map[string]string `json:"env" example:"{\"PORT\": \"3000\"}"`
5454
WaitForCompletion bool `json:"waitForCompletion" example:"false"`
55-
Timeout int `json:"timeout" example:"30"`
55+
Timeout *int `json:"timeout,omitempty" example:"30"`
5656
WaitForPorts []int `json:"waitForPorts" example:"3000,8080"`
5757
RestartOnFailure bool `json:"restartOnFailure" example:"true"`
5858
MaxRestarts int `json:"maxRestarts" example:"3"`
59+
KeepAlive bool `json:"keepAlive" example:"false"`
5960
} // @name ProcessRequest
6061

6162
// ProcessResponse is the response body for a process
@@ -74,6 +75,7 @@ type ProcessResponse struct {
7475
RestartOnFailure bool `json:"restartOnFailure" example:"true"`
7576
MaxRestarts int `json:"maxRestarts" example:"3"`
7677
RestartCount int `json:"restartCount" example:"2"`
78+
KeepAlive bool `json:"keepAlive" example:"false"`
7779
} // @name ProcessResponse
7880

7981
type ProcessResponseWithLogs struct {
@@ -87,8 +89,8 @@ type ProcessKillRequest struct {
8789
} // @name ProcessKillRequest
8890

8991
// ExecuteProcess executes a process
90-
func (h *ProcessHandler) ExecuteProcess(command string, workingDir string, name string, env map[string]string, waitForCompletion bool, timeout int, waitForPorts []int, restartOnFailure bool, maxRestarts int) (ProcessResponse, error) {
91-
processInfo, err := h.processManager.ExecuteProcess(command, workingDir, name, env, waitForCompletion, timeout, waitForPorts, restartOnFailure, maxRestarts)
92+
func (h *ProcessHandler) ExecuteProcess(command string, workingDir string, name string, env map[string]string, waitForCompletion bool, timeout int, waitForPorts []int, restartOnFailure bool, maxRestarts int, keepAlive bool) (ProcessResponse, error) {
93+
processInfo, err := h.processManager.ExecuteProcess(command, workingDir, name, env, waitForCompletion, timeout, waitForPorts, restartOnFailure, maxRestarts, keepAlive)
9294

9395
// If processInfo is nil (process failed to start), return empty response with error
9496
if processInfo == nil {
@@ -117,6 +119,7 @@ func (h *ProcessHandler) ExecuteProcess(command string, workingDir string, name
117119
RestartOnFailure: processInfo.RestartOnFailure,
118120
MaxRestarts: processInfo.MaxRestarts,
119121
RestartCount: processInfo.RestartCount,
122+
KeepAlive: processInfo.KeepAlive,
120123
}, err
121124
}
122125

@@ -158,6 +161,7 @@ func (h *ProcessHandler) ListProcesses() []ProcessResponse {
158161
RestartOnFailure: p.RestartOnFailure,
159162
MaxRestarts: p.MaxRestarts,
160163
RestartCount: p.RestartCount,
164+
KeepAlive: p.KeepAlive,
161165
})
162166
}
163167
return result
@@ -202,6 +206,7 @@ func (h *ProcessHandler) GetProcess(identifier string) (ProcessResponse, error)
202206
RestartOnFailure: processInfo.RestartOnFailure,
203207
MaxRestarts: processInfo.MaxRestarts,
204208
RestartCount: processInfo.RestartCount,
209+
KeepAlive: processInfo.KeepAlive,
205210
}, nil
206211
}
207212

@@ -293,8 +298,18 @@ func (h *ProcessHandler) HandleExecuteCommand(c *gin.Context) {
293298
"working-dir": req.WorkingDir,
294299
})
295300

301+
// Timeout of 0 means infinite (no auto-kill)
302+
// When keepAlive is true and timeout is not specified (nil) or negative, default to 600s
303+
timeout := 0
304+
if req.Timeout != nil {
305+
timeout = *req.Timeout
306+
}
307+
if req.KeepAlive && (req.Timeout == nil || timeout < 0) {
308+
timeout = 600 // Default 10 minutes
309+
}
310+
296311
// Execute the process
297-
processInfo, err := h.ExecuteProcess(req.Command, req.WorkingDir, req.Name, req.Env, req.WaitForCompletion, req.Timeout, req.WaitForPorts, req.RestartOnFailure, req.MaxRestarts)
312+
processInfo, err := h.ExecuteProcess(req.Command, req.WorkingDir, req.Name, req.Env, req.WaitForCompletion, timeout, req.WaitForPorts, req.RestartOnFailure, req.MaxRestarts, req.KeepAlive)
298313
if err != nil {
299314
h.SendError(c, http.StatusUnprocessableEntity, err)
300315
return
@@ -335,6 +350,16 @@ func (h *ProcessHandler) handleExecuteCommandStream(c *gin.Context) {
335350
"working-dir": req.WorkingDir,
336351
})
337352

353+
// Timeout of 0 means infinite (no auto-kill)
354+
// When keepAlive is true and timeout is not specified (nil) or negative, default to 600s
355+
timeout := 0
356+
if req.Timeout != nil {
357+
timeout = *req.Timeout
358+
}
359+
if req.KeepAlive && (req.Timeout == nil || timeout < 0) {
360+
timeout = 600 // Default 10 minutes
361+
}
362+
338363
// Set headers for streaming JSON events
339364
c.Writer.Header().Set("Content-Type", "application/x-ndjson")
340365
c.Writer.Header().Set("Cache-Control", "no-cache")
@@ -346,7 +371,7 @@ func (h *ProcessHandler) handleExecuteCommandStream(c *gin.Context) {
346371
jw := &JSONStreamWriter{gin: c}
347372

348373
// Execute the process without waiting for completion (we'll handle waiting ourselves)
349-
processInfo, err := h.ExecuteProcess(req.Command, req.WorkingDir, req.Name, req.Env, false, req.Timeout, req.WaitForPorts, req.RestartOnFailure, req.MaxRestarts)
374+
processInfo, err := h.ExecuteProcess(req.Command, req.WorkingDir, req.Name, req.Env, false, timeout, req.WaitForPorts, req.RestartOnFailure, req.MaxRestarts, req.KeepAlive)
350375
if err != nil {
351376
jw.WriteEvent("error", err.Error())
352377
return

0 commit comments

Comments
 (0)