feat(profile): safer torch.profiler defaults + per-grad-step capture by leofan-lab · Pull Request #1879 · THUDM/slime

leofan-lab · 2026-04-29T17:00:05Z

Summary

Three related fixes to make slime's torch.profiler wiring usable on large MoE models (26B+), where the default settings OOM the host before any trace can flush.

Motivation

The existing profiling setup has a few gotchas that make it hard to use on real workloads:

Every rank allocates a full profiler buffer. On a 26B MoE, each rank's torch.profiler buffer reaches ~60 GB during the active window. Across 16 ranks, that's ~1 TB of host RAM pressure and we've seen host-OOM kills before any trace flushes to disk.
The only viable --profile-target was train_overall, which spans an entire rollout (~45 min of kernel events on a 26B MoE). The resulting trace files are 10+ GB gzipped — they take hours to serialize, and Chrome/Perfetto can't load them.
record_shapes / with_flops / with_stack / profile_memory all default to on, which further amplifies memory and trace size.

Changes

1. Rank-0 only by default

_should_profile_this_rank() defaults to rank 0 only. Set SLIME_PROFILE_ALL_RANKS=1 to opt into per-rank traces when diagnosing cross-rank sync or PP-stage imbalance.

2. Conservative metadata defaults

All four memory-amplifier flags default to off. Opt in via env vars:

Env var	Enables
`SLIME_PROFILE_RECORD_SHAPES=1`	`record_shapes=True`
`SLIME_PROFILE_WITH_FLOPS=1`	`with_flops=True`
`SLIME_PROFILE_WITH_STACK=1`	`with_stack=True`
`SLIME_PROFILE_MEMORY=1`	`profile_memory=True`

Env vars accept 1 / true / yes (case-insensitive) as truthy.

3. Add `train_actor` / `train_log_probs` profile targets (per-step capture)

Two new targets scope capture to a single step instead of a full rollout:

train_actor — one active window covers a single grad-accum step inside actor_train. Trace is ~500× smaller (~15 MB) and actually openable in Perfetto/Chrome.
train_log_probs — one active window per log-probs or values forward pass.

Both are wired via a step_callback: Callable[[], None] | None = None parameter on train() and forward_only(). The actor passes self.prof.step_train_actor / self.prof.step_train_log_probs through. When profiling is disabled the callback is None, so the hot path is unchanged.

Existing train_overall target continues to work.

Testing

Tested on:

Qwen3.5-122B-A10B SFT (TP=2, PP=2, EP=16, 4×p5en) with --profile-target train_actor — produces single ~50 MB trace at rank 0.
Qwen3.5-35B-A3B retool async RL (single node) with --profile-target train_log_probs — verified wiring.

Compatibility

No behavior change for existing users unless they set --profile-target train_actor / train_log_probs or the new env vars.
Default profile output is now rank-0 only (was all ranks). Users relying on per-rank traces must set SLIME_PROFILE_ALL_RANKS=1.
Default metadata is now more conservative. Users relying on record_shapes etc. must opt in via env vars.

- Default to rank-0 only (previously all ranks). Each rank allocates ~60 GB of profiler buffer on 26B MoEs; all-rank capture has caused host-OOM. Opt in via SLIME_PROFILE_ALL_RANKS=1. - Default record_shapes / with_flops / with_stack / profile_memory to off. Opt in via SLIME_PROFILE_{RECORD_SHAPES,WITH_FLOPS, WITH_STACK,MEMORY}=1. - Add train_actor and train_log_probs profile targets scoped to a single step (~15 MB trace, openable in Perfetto). Previously only train_overall existed, which produced multi-GB traces spanning a full rollout. Wired via a step_callback parameter on train() and forward_only(); callback is None when profiling is off, keeping the hot path unchanged.

leofan-lab · 2026-05-11T18:55:10Z

The merge seems to cause indentation error and thus test failure, would you mind fix it?

leofan-lab and others added 2 commits April 29, 2026 16:29

Merge branch 'main' into feat/profile-safety-improvements

fd931fa

zhuzilin approved these changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(profile): safer torch.profiler defaults + per-grad-step capture#1879

feat(profile): safer torch.profiler defaults + per-grad-step capture#1879
leofan-lab wants to merge 2 commits into
THUDM:mainfrom
leofan-lab:feat/profile-safety-improvements

leofan-lab commented Apr 29, 2026

Uh oh!

leofan-lab commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leofan-lab commented Apr 29, 2026

Summary

Motivation

Changes

1. Rank-0 only by default

2. Conservative metadata defaults

3. Add train_actor / train_log_probs profile targets (per-step capture)

Testing

Compatibility

Uh oh!

leofan-lab commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

3. Add `train_actor` / `train_log_probs` profile targets (per-step capture)

leofan-lab commented May 11, 2026 •

edited

Loading