Skip to content

feat(profile): safer torch.profiler defaults + per-grad-step capture#1879

Open
leofan-lab wants to merge 2 commits into
THUDM:mainfrom
leofan-lab:feat/profile-safety-improvements
Open

feat(profile): safer torch.profiler defaults + per-grad-step capture#1879
leofan-lab wants to merge 2 commits into
THUDM:mainfrom
leofan-lab:feat/profile-safety-improvements

Conversation

@leofan-lab
Copy link
Copy Markdown
Contributor

Summary

Three related fixes to make slime's torch.profiler wiring usable on large MoE models (26B+), where the default settings OOM the host before any trace can flush.

Motivation

The existing profiling setup has a few gotchas that make it hard to use on real workloads:

  • Every rank allocates a full profiler buffer. On a 26B MoE, each rank's torch.profiler buffer reaches ~60 GB during the active window. Across 16 ranks, that's ~1 TB of host RAM pressure and we've seen host-OOM kills before any trace flushes to disk.
  • The only viable --profile-target was train_overall, which spans an entire rollout (~45 min of kernel events on a 26B MoE). The resulting trace files are 10+ GB gzipped — they take hours to serialize, and Chrome/Perfetto can't load them.
  • record_shapes / with_flops / with_stack / profile_memory all default to on, which further amplifies memory and trace size.

Changes

1. Rank-0 only by default

_should_profile_this_rank() defaults to rank 0 only. Set SLIME_PROFILE_ALL_RANKS=1 to opt into per-rank traces when diagnosing cross-rank sync or PP-stage imbalance.

2. Conservative metadata defaults

All four memory-amplifier flags default to off. Opt in via env vars:

Env var Enables
SLIME_PROFILE_RECORD_SHAPES=1 record_shapes=True
SLIME_PROFILE_WITH_FLOPS=1 with_flops=True
SLIME_PROFILE_WITH_STACK=1 with_stack=True
SLIME_PROFILE_MEMORY=1 profile_memory=True

Env vars accept 1 / true / yes (case-insensitive) as truthy.

3. Add train_actor / train_log_probs profile targets (per-step capture)

Two new targets scope capture to a single step instead of a full rollout:

  • train_actor — one active window covers a single grad-accum step inside actor_train. Trace is ~500× smaller (~15 MB) and actually openable in Perfetto/Chrome.
  • train_log_probs — one active window per log-probs or values forward pass.

Both are wired via a step_callback: Callable[[], None] | None = None parameter on train() and forward_only(). The actor passes self.prof.step_train_actor / self.prof.step_train_log_probs through. When profiling is disabled the callback is None, so the hot path is unchanged.

Existing train_overall target continues to work.

Testing

Tested on:

  • Qwen3.5-122B-A10B SFT (TP=2, PP=2, EP=16, 4×p5en) with --profile-target train_actor — produces single ~50 MB trace at rank 0.
  • Qwen3.5-35B-A3B retool async RL (single node) with --profile-target train_log_probs — verified wiring.

Compatibility

  • No behavior change for existing users unless they set --profile-target train_actor / train_log_probs or the new env vars.
  • Default profile output is now rank-0 only (was all ranks). Users relying on per-rank traces must set SLIME_PROFILE_ALL_RANKS=1.
  • Default metadata is now more conservative. Users relying on record_shapes etc. must opt in via env vars.

leofan-lab and others added 2 commits April 29, 2026 16:29
- Default to rank-0 only (previously all ranks). Each rank allocates
  ~60 GB of profiler buffer on 26B MoEs; all-rank capture has caused
  host-OOM. Opt in via SLIME_PROFILE_ALL_RANKS=1.

- Default record_shapes / with_flops / with_stack / profile_memory
  to off. Opt in via SLIME_PROFILE_{RECORD_SHAPES,WITH_FLOPS,
  WITH_STACK,MEMORY}=1.

- Add train_actor and train_log_probs profile targets scoped to a
  single step (~15 MB trace, openable in Perfetto). Previously only
  train_overall existed, which produced multi-GB traces spanning a
  full rollout. Wired via a step_callback parameter on train() and
  forward_only(); callback is None when profiling is off, keeping
  the hot path unchanged.
@leofan-lab
Copy link
Copy Markdown
Contributor Author

leofan-lab commented May 11, 2026

The merge seems to cause indentation error and thus test failure, would you mind fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants