feat(profile): safer torch.profiler defaults + per-grad-step capture#1879
Open
leofan-lab wants to merge 2 commits into
Open
feat(profile): safer torch.profiler defaults + per-grad-step capture#1879leofan-lab wants to merge 2 commits into
leofan-lab wants to merge 2 commits into
Conversation
- Default to rank-0 only (previously all ranks). Each rank allocates
~60 GB of profiler buffer on 26B MoEs; all-rank capture has caused
host-OOM. Opt in via SLIME_PROFILE_ALL_RANKS=1.
- Default record_shapes / with_flops / with_stack / profile_memory
to off. Opt in via SLIME_PROFILE_{RECORD_SHAPES,WITH_FLOPS,
WITH_STACK,MEMORY}=1.
- Add train_actor and train_log_probs profile targets scoped to a
single step (~15 MB trace, openable in Perfetto). Previously only
train_overall existed, which produced multi-GB traces spanning a
full rollout. Wired via a step_callback parameter on train() and
forward_only(); callback is None when profiling is off, keeping
the hot path unchanged.
zhuzilin
approved these changes
May 11, 2026
Contributor
Author
|
The merge seems to cause indentation error and thus test failure, would you mind fix it? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three related fixes to make slime's
torch.profilerwiring usable on large MoE models (26B+), where the default settings OOM the host before any trace can flush.Motivation
The existing profiling setup has a few gotchas that make it hard to use on real workloads:
torch.profilerbuffer reaches ~60 GB during the active window. Across 16 ranks, that's ~1 TB of host RAM pressure and we've seen host-OOM kills before any trace flushes to disk.--profile-targetwastrain_overall, which spans an entire rollout (~45 min of kernel events on a 26B MoE). The resulting trace files are 10+ GB gzipped — they take hours to serialize, and Chrome/Perfetto can't load them.record_shapes/with_flops/with_stack/profile_memoryall default to on, which further amplifies memory and trace size.Changes
1. Rank-0 only by default
_should_profile_this_rank()defaults to rank 0 only. SetSLIME_PROFILE_ALL_RANKS=1to opt into per-rank traces when diagnosing cross-rank sync or PP-stage imbalance.2. Conservative metadata defaults
All four memory-amplifier flags default to off. Opt in via env vars:
SLIME_PROFILE_RECORD_SHAPES=1record_shapes=TrueSLIME_PROFILE_WITH_FLOPS=1with_flops=TrueSLIME_PROFILE_WITH_STACK=1with_stack=TrueSLIME_PROFILE_MEMORY=1profile_memory=TrueEnv vars accept
1/true/yes(case-insensitive) as truthy.3. Add
train_actor/train_log_probsprofile targets (per-step capture)Two new targets scope capture to a single step instead of a full rollout:
train_actor— one active window covers a single grad-accum step insideactor_train. Trace is ~500× smaller (~15 MB) and actually openable in Perfetto/Chrome.train_log_probs— one active window per log-probs or values forward pass.Both are wired via a
step_callback: Callable[[], None] | None = Noneparameter ontrain()andforward_only(). The actor passesself.prof.step_train_actor/self.prof.step_train_log_probsthrough. When profiling is disabled the callback isNone, so the hot path is unchanged.Existing
train_overalltarget continues to work.Testing
Tested on:
--profile-target train_actor— produces single ~50 MB trace at rank 0.--profile-target train_log_probs— verified wiring.Compatibility
--profile-target train_actor/train_log_probsor the new env vars.SLIME_PROFILE_ALL_RANKS=1.record_shapesetc. must opt in via env vars.