Skip to content

Neutralize zero-advantage samples to skip wasted forward compute#1901

Open
nanjiangwill wants to merge 8 commits into
mainfrom
filter-zero-reward
Open

Neutralize zero-advantage samples to skip wasted forward compute#1901
nanjiangwill wants to merge 8 commits into
mainfrom
filter-zero-reward

Conversation

@nanjiangwill
Copy link
Copy Markdown
Collaborator

@nanjiangwill nanjiangwill commented May 11, 2026

Summary

In _convert_samples_to_train_data, for samples with post-processed reward == 0, replace tokens with a 2-token pad sequence (pad_token_id = 0 matching slime/backends/megatron_utils/data.py), set response_length = 1, zero the loss mask, and mark remove_sample. Forward pass becomes trivially cheap and gradient is zero. Sample count is unchanged. Logs rollout/neutralized_ratio per rollout.

🤖 Generated with Claude Code

Adds --filter-zero-advantage-samples to drop samples whose post-processed
reward is 0 before training (they contribute zero gradient). When fewer
than dp_size non-zero samples survive, pad to dp_size with dropped
zero-advantage samples and zero their loss masks via remove_sample.

Restructures rollout logging so each wandb key has a single writer:
- rollout-source aggregates (raw_reward, rewards, response_lengths,
  total_lengths) are emitted from RolloutManager._log_rollout_data on
  the pre-filter batch.
- train-side log_rollout_data skip-lists those keys; it now only emits
  train artifacts (log_probs, advantages, kl, returns, values, etc.).
- filter metrics live under rollout/filter/*.

Refactors _convert_samples_to_train_data into a pure converter: callers
must pass precomputed (raw_rewards, rewards). The custom convert hook
signature changes to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nanjiangwill nanjiangwill changed the title Filter zero-advantage samples; split rollout/train logging boundary filter zero-advantage samples; split rollout/train logging boundary May 11, 2026
@zhuzilin zhuzilin requested a review from Copilot May 11, 2026 08:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes rollout→train data flow by (1) adding an option to drop zero-advantage samples (with padding back to dp_size when needed) and (2) moving rollout-derived aggregate metrics (raw_reward, rewards, response_lengths, total_lengths) to be logged on the rollout side so each W&B key has a single writer. It also updates the plugin hook contract so custom convert_samples_to_train_data implementations receive (samples, raw_rewards, rewards).

Changes:

  • Add --filter-zero-advantage-samples (requires --use-dynamic-global-batch-size) and apply filtering/padding before conversion to train data.
  • Split rollout vs train-side logging responsibility by logging reward/length aggregates in RolloutManager._log_rollout_data and skipping them in Megatron-side rollout logging.
  • Update custom convert hook signature and its contract test to accept (args, samples, raw_rewards, rewards).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
tests/plugin_contracts/test_plugin_runtime_hook_contracts.py Updates the plugin contract test for the breaking hook signature change (convert hook now receives rewards inputs).
slime/utils/arguments.py Adds the CLI flag + validation for zero-advantage filtering; updates help text for convert hook signature.
slime/ray/rollout.py Computes rewards earlier, adds zero-advantage filtering/padding, refactors conversion signature, and moves rollout aggregates into rollout-side logging.
slime/backends/megatron_utils/data.py Prevents duplicate W&B writers by skipping rollout-source aggregate keys that are now logged in slime/ray/rollout.py.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread slime/ray/rollout.py Outdated
Comment on lines +1242 to +1243
sample.metadata["raw_reward"] if sample.metadata and "raw_reward" in sample.metadata else sample.reward
for sample in samples
Comment thread slime/ray/rollout.py Outdated
Comment on lines +493 to +500
if self.args.filter_zero_advantage_samples:
data, raw_rewards, rewards = self._filter_zero_advantage_samples(data, raw_rewards, rewards)
self._dynamic_global_batch_size = self._compute_dynamic_global_batch_size(len(data))

if self.custom_convert_samples_to_train_data_func is not None:
data = self.custom_convert_samples_to_train_data_func(self.args, data, raw_rewards, rewards)
else:
data = self._convert_samples_to_train_data(data, raw_rewards, rewards)
zhuzilin and others added 2 commits May 11, 2026 15:46
Instead of dropping zero-advantage samples and dealing with dp_size
partition complexity, mutate them in place: replace tokens with a
2-token pad sequence, set response_length=1, zero the loss mask. Forward
pass becomes trivially cheap and gradient is zero, while sample count
stays constant so all downstream partitioning works unchanged.

Drops the --filter-zero-advantage-samples flag (now default) and its
validator. Adds rollout/zero_advantage_ratio to track the underlying
policy state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nanjiangwill nanjiangwill changed the title filter zero-advantage samples; split rollout/train logging boundary Neutralize zero-advantage samples; split rollout/train logging boundary May 11, 2026
For samples whose post-processed reward is 0, replace tokens with a
2-token pad sequence and zero the loss mask. Forward pass is cheap and
gradient is zero. Sample count is unchanged so downstream partitioning
is untouched. Logs rollout/neutralized_ratio per rollout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nanjiangwill nanjiangwill changed the title Neutralize zero-advantage samples; split rollout/train logging boundary Neutralize zero-advantage samples to skip wasted forward compute May 11, 2026
nanjiangwill and others added 4 commits May 11, 2026 10:15
…to main

Earlier restore happened against a stale fetch, inadvertently reverting
the upstream rollout_data_postprocess CI fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
loss_mask=[0] already gives the same effect during training; remove_sample
would just retrigger the same zeroing inside _convert_samples_to_train_data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing convert loop re-zeros loss_mask when remove_sample is set.
Keeping both protects the invariant against future refactors or plugin
paths that might reset loss_mask between this loop and the convert loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Only these estimators compute per-token advantage as a scalar broadcast
of rewards (advantage = rewards * ones). For them, r==0 implies the
sample contributes zero gradient and is safe to neutralize. ppo (GAE)
and reinforce_plus_plus (token-level kl rewards) can have non-zero
advantage even when r==0, so they are excluded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants