Neutralize zero-advantage samples to skip wasted forward compute#1901
Open
nanjiangwill wants to merge 8 commits into
Open
Neutralize zero-advantage samples to skip wasted forward compute#1901nanjiangwill wants to merge 8 commits into
nanjiangwill wants to merge 8 commits into
Conversation
Adds --filter-zero-advantage-samples to drop samples whose post-processed reward is 0 before training (they contribute zero gradient). When fewer than dp_size non-zero samples survive, pad to dp_size with dropped zero-advantage samples and zero their loss masks via remove_sample. Restructures rollout logging so each wandb key has a single writer: - rollout-source aggregates (raw_reward, rewards, response_lengths, total_lengths) are emitted from RolloutManager._log_rollout_data on the pre-filter batch. - train-side log_rollout_data skip-lists those keys; it now only emits train artifacts (log_probs, advantages, kl, returns, values, etc.). - filter metrics live under rollout/filter/*. Refactors _convert_samples_to_train_data into a pure converter: callers must pass precomputed (raw_rewards, rewards). The custom convert hook signature changes to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR optimizes rollout→train data flow by (1) adding an option to drop zero-advantage samples (with padding back to dp_size when needed) and (2) moving rollout-derived aggregate metrics (raw_reward, rewards, response_lengths, total_lengths) to be logged on the rollout side so each W&B key has a single writer. It also updates the plugin hook contract so custom convert_samples_to_train_data implementations receive (samples, raw_rewards, rewards).
Changes:
- Add
--filter-zero-advantage-samples(requires--use-dynamic-global-batch-size) and apply filtering/padding before conversion to train data. - Split rollout vs train-side logging responsibility by logging reward/length aggregates in
RolloutManager._log_rollout_dataand skipping them in Megatron-side rollout logging. - Update custom convert hook signature and its contract test to accept
(args, samples, raw_rewards, rewards).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/plugin_contracts/test_plugin_runtime_hook_contracts.py | Updates the plugin contract test for the breaking hook signature change (convert hook now receives rewards inputs). |
| slime/utils/arguments.py | Adds the CLI flag + validation for zero-advantage filtering; updates help text for convert hook signature. |
| slime/ray/rollout.py | Computes rewards earlier, adds zero-advantage filtering/padding, refactors conversion signature, and moves rollout aggregates into rollout-side logging. |
| slime/backends/megatron_utils/data.py | Prevents duplicate W&B writers by skipping rollout-source aggregate keys that are now logged in slime/ray/rollout.py. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1242
to
+1243
| sample.metadata["raw_reward"] if sample.metadata and "raw_reward" in sample.metadata else sample.reward | ||
| for sample in samples |
Comment on lines
+493
to
+500
| if self.args.filter_zero_advantage_samples: | ||
| data, raw_rewards, rewards = self._filter_zero_advantage_samples(data, raw_rewards, rewards) | ||
| self._dynamic_global_batch_size = self._compute_dynamic_global_batch_size(len(data)) | ||
|
|
||
| if self.custom_convert_samples_to_train_data_func is not None: | ||
| data = self.custom_convert_samples_to_train_data_func(self.args, data, raw_rewards, rewards) | ||
| else: | ||
| data = self._convert_samples_to_train_data(data, raw_rewards, rewards) |
Instead of dropping zero-advantage samples and dealing with dp_size partition complexity, mutate them in place: replace tokens with a 2-token pad sequence, set response_length=1, zero the loss mask. Forward pass becomes trivially cheap and gradient is zero, while sample count stays constant so all downstream partitioning works unchanged. Drops the --filter-zero-advantage-samples flag (now default) and its validator. Adds rollout/zero_advantage_ratio to track the underlying policy state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For samples whose post-processed reward is 0, replace tokens with a 2-token pad sequence and zero the loss mask. Forward pass is cheap and gradient is zero. Sample count is unchanged so downstream partitioning is untouched. Logs rollout/neutralized_ratio per rollout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…to main Earlier restore happened against a stale fetch, inadvertently reverting the upstream rollout_data_postprocess CI fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
loss_mask=[0] already gives the same effect during training; remove_sample would just retrigger the same zeroing inside _convert_samples_to_train_data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing convert loop re-zeros loss_mask when remove_sample is set. Keeping both protects the invariant against future refactors or plugin paths that might reset loss_mask between this loop and the convert loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Only these estimators compute per-token advantage as a scalar broadcast of rewards (advantage = rewards * ones). For them, r==0 implies the sample contributes zero gradient and is safe to neutralize. ppo (GAE) and reinforce_plus_plus (token-level kl rewards) can have non-zero advantage even when r==0, so they are excluded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
In
_convert_samples_to_train_data, for samples with post-processed reward == 0, replace tokens with a 2-token pad sequence (pad_token_id = 0matchingslime/backends/megatron_utils/data.py), setresponse_length = 1, zero the loss mask, and markremove_sample. Forward pass becomes trivially cheap and gradient is zero. Sample count is unchanged. Logsrollout/neutralized_ratioper rollout.🤖 Generated with Claude Code