Place inner container cgroups under the pod's cgroup tree#169
Place inner container cgroups under the pod's cgroup tree#169jamestoyer wants to merge 2 commits intocoder:mainfrom
Conversation
|
Hey there, I've tagged another coder engineer to do the approval, but I did read through and have a agent help review. OverviewThis PR wraps The approach is sound — it mirrors moby's canonical CI Status
Issues Found1. CRITICAL:
|
|
I believe this issue with |
| // wrapDockerdCmd wraps the dockerd invocation with `unshare --cgroup` + | ||
| // cgroup2 remount + cgroupv2 delegation setup. This creates a new cgroup | ||
| // namespace for dockerd and moves existing processes into a sibling `/init` | ||
| // cgroup so the root cgroup can enable subtree_control. Without this, cgroupv2 | ||
| // "no internal processes" rule prevents Docker from creating child cgroups | ||
| // with processes under a cgroup that already has processes. | ||
| // | ||
| // Note: the `umount /sys/fs/cgroup && mount -t cgroup2 ...` leaks back into | ||
| // envbox's mount namespace because we don't use `--mount`. Using `--mount` | ||
| // would break sysbox-fs propagation (its per-container mounts under | ||
| // /var/lib/sysboxfs/ stop being visible to sysbox-runc in the dockerd NS). | ||
| // The side effect of the leak is a non-fatal "Unable to read CPU quota" | ||
| // warning when envbox later tries to read /sys/fs/cgroup/<host-path>/cpu.max | ||
| // — that path no longer exists under the remounted view. The inner | ||
| // container's CPU limits still flow through the parent pod's cgroup hierarchy, | ||
| // but cgroup-aware tools inside the workspace may see the host CPU count. | ||
| // | ||
| // The cgroup delegation logic mirrors moby's `hack/dind` wrapper: | ||
| // https://github.com/moby/moby/blob/master/hack/dind | ||
| // | ||
| // After this setup, inner container cgroups are placed under the envbox | ||
| // container's cgroup on the host. Tetragon's cgtracker can then associate | ||
| // these child cgroups with the parent pod for attribution. | ||
| // | ||
| // See also: https://github.com/moby/moby/issues/45378#issuecomment-2886261231 |
There was a problem hiding this comment.
-
Can we permalink directly to the relevant portion https://github.com/moby/moby/blob/ee3e21b70457f90d537640613d59340db1f0178c/hack/dind#L61-L79
-
I'd prefer if we kept the verbatim comments from upstream, if possible
| umount /sys/fs/cgroup | ||
| mount -t cgroup2 cgroup /sys/fs/cgroup |
There was a problem hiding this comment.
I'd prefer if we didn't do this on a cgroupv1 system. Would it make more sense to do it after the check on L916?
|
Ok I think I've addressed the PR feedback (well I say me, I mean Claude). I have tested this out on our version of Coder and the changes are still producing the expected outcomes, which is good. @f0ssel It doesn't look like |
You are right, I was thinking this was coder/coder, I think we need to patch main here like we did there to fix this issue. CC @jdomeracki-coder |
|
johnstcn
left a comment
There was a problem hiding this comment.
Nice refactor! Some review comments below.
Also noting that the code comments are overly verbose and could be trimmed down a good bit. Claude is generally pretty good at this when you tell it to.
| func (d *Process) SetBinName(name string) { | ||
| d.mu.Lock() | ||
| defer d.mu.Unlock() | ||
|
|
||
| d.binName = name | ||
| } |
There was a problem hiding this comment.
After every restart, callers need to remember to set this otherwise it gets lost.
Would it make sense to instead make this a required parameter?
| // dockerd is launched via an unshare wrapper that exec's into | ||
| // dockerd with these args; assert against the full wrapped argv. | ||
| wrapCmd, wrapArgs := cli.WrapDockerdCmd(dockerdArgs) |
There was a problem hiding this comment.
This makes the test tautological. You should instead explicitly specify the expected argv.
| shellCmd := fmt.Sprintf(`set -e | ||
| # cgroup v2: enable nesting | ||
| if [ -f /sys/fs/cgroup/cgroup.controllers ]; then | ||
| # Remount /sys/fs/cgroup so the new cgroup namespace's view becomes the | ||
| # fs root. Inner container cgroups will then be created under the envbox | ||
| # container's cgroup on the host. | ||
| umount /sys/fs/cgroup | ||
| mount -t cgroup2 cgroup /sys/fs/cgroup | ||
|
|
||
| # move the processes from the root group to the /init group, | ||
| # otherwise writing subtree_control fails with EBUSY. | ||
| # An error during moving non-existent process (i.e., "cat") is ignored. | ||
| mkdir -p /sys/fs/cgroup/init | ||
| # this happens in a loop because things like "docker exec" on our dind | ||
| # container will create new processes, which creates a race between our | ||
| # moving everything to "init" and enabling subtree_control | ||
| envbox_attempts=0 | ||
| while ! { | ||
| # move the processes from the root group to the /init group, | ||
| # otherwise writing subtree_control fails with EBUSY. | ||
| # An error during moving non-existent process (i.e., "cat") is ignored. | ||
| xargs -rn1 < /sys/fs/cgroup/cgroup.procs > /sys/fs/cgroup/init/cgroup.procs || : | ||
| # enable controllers | ||
| sed -e 's/ / +/g' -e 's/^/+/' < /sys/fs/cgroup/cgroup.controllers \ | ||
| > /sys/fs/cgroup/cgroup.subtree_control | ||
| }; do | ||
| envbox_attempts=$((envbox_attempts + 1)) | ||
| if [ "$envbox_attempts" -ge %d ]; then | ||
| echo "envbox: failed to enable cgroup.subtree_control after $envbox_attempts attempts" >&2 | ||
| exit 1 | ||
| fi | ||
| done | ||
| fi | ||
| exec "$0" "$@" |
There was a problem hiding this comment.
suggestion: use //go:embed to extract this to a standalone file. You then get syntax highlighting and the ability to test it independently.
| // dockerdSubtreeControlMaxAttempts caps the cgroup-subtree-control retry loop | ||
| // so a stuck race can't block envbox startup indefinitely. | ||
| const dockerdSubtreeControlMaxAttempts = 100 |
There was a problem hiding this comment.
Worth mentioning this is a divergence from the upstream implementation. (I think we should keep it though)
| umount /sys/fs/cgroup | ||
| mount -t cgroup2 cgroup /sys/fs/cgroup |
There was a problem hiding this comment.
suggestion: set -e will cause this to abort with no context-specific error message, so consider something like
umount /sys/fs/cgroup || echo "envbox: failed to umount cgroup" >&2; exit 1;
This has been merged! You should be good to rebase or merge main now. |
Closes #168
Summary
Wrap the dockerd invocation with
unshare --cgroup, a freshcgroup2remount, and the cgroup-delegation logic from moby'shack/dindscript. After the wrapper runs, dockerd's view of/sys/fs/cgroupis rooted at the envbox container's own cgroup, so all inner container cgroups are created as descendants of that scope on the host. Host-level cgroup-aware observability tools (Tetragon, Falco, custom eBPF agents) can then attribute processes inside inner containers to the parent pod.This is the approach Isovalent (Cilium/Tetragon vendor) recommended after testing it themselves — see moby/moby#45378 (comment).
Changes
cli/docker.gowrapDockerdCmd(dargs)helper that returns theunshare-prefixed command + args. The shell snippet does:umount /sys/fs/cgroup+mount -t cgroup2 cgroup /sys/fs/cgroup— fresh cgroup2 mount rooted at the new cgroup namespace's root (the envbox container's cgroup on the host)mkdir /sys/fs/cgroup/init+ retry loop that moves all processes viaxargs -rn1 < cgroup.procs > init/cgroup.procsand enables every available controller incgroup.subtree_control— needed because cgroupv2 forbids a cgroup from having both processes andsubtree_controlsetexecinto dockerd with the original argsdockerd(initial start and the twoIsNoSpaceErrrecovery paths) to go through the wrapper.xunix/sys.goreadCPUQuotaCGroupV2: if/sys/fs/cgroup/<self>/cpu.maxdoesn't exist (because the cgroup remount above has reparented the mount root), read/sys/fs/cgroup/cpu.maxdirectly. The mount root IS the current cgroup in that case so the values are equivalent.Why no
--mounton unshare?unshare --cgroupalone leaves the mount namespace shared with the parent envbox process, so theumount+mount -t cgroup2operations leak back to envbox. We tried isolating with--mount:--propagation private→ breaks sysbox-fs: its per-container mounts under/var/lib/sysboxfs/<id>/stop being visible tosysbox-runcin the dockerd namespace and inner-container creation fails.--propagation slave→ same failure: the parent mount points areprivateby default, so nothing propagates to the slave child.Making
/var/lib/sysboxfs/(or/)rsharedin envbox's mount namespace before the unshare would work but is more invasive. The pragmatic compromise: accept the mount-namespace leak and handle the one observable side effect (the CPU quota read) via thexunix/sys.gofallback. CPU enforcement is unaffected — the inner workspace container's cgroup is now a descendant of the pod's cgroup tree, so the pod's resource limits apply transitively.Verification
Tested against:
--enable-cgtrackerid=trueEnd-to-end check: after a workspace start, running
whoamifrom a Coder terminal (inside the inner workspace container) now produces a Tetragonprocess_execevent with a fullpodfield — namespace, name, UID, container ID, image, security context, and pod labels. Before this change the same event had nopodfield at all.Workspace startup is otherwise unchanged. Sysbox-fs continues to provide its
/sysand/procvirtualization correctly. The "no internal processes" cgroupv2 rule that previously blocked nested container creation under a privileged pod's.scopecgroup is now satisfied because all envbox/sysbox processes are first migrated into the/initsibling cgroup.Backwards compatibility
The wrapper is a no-op for cgroup configurations the delegation block doesn't apply to:
if [ -f /sys/fs/cgroup/cgroup.controllers ]guard skips the delegation block; only the umount/mount happens.unshare --cgroup(< 4.6, very rare today): would surface as a startup error.For cgroup v2 hosts the only behavioural change is the cgroup placement of inner containers — which is what we want.
References
hack/dindscript (cgroup delegation logic borrowed from there)cgtracker(consumes the new hierarchy on the agent side)