Neutralize zero-advantage samples to skip wasted forward compute by nanjiangwill · Pull Request #1901 · THUDM/slime

nanjiangwill · 2026-05-11T04:34:14Z

Summary

In _convert_samples_to_train_data, for samples with post-processed reward == 0, replace tokens with a 2-token pad sequence (pad_token_id = 0 matching slime/backends/megatron_utils/data.py), set response_length = 1, zero the loss mask, and mark remove_sample. Forward pass becomes trivially cheap and gradient is zero. Sample count is unchanged. Logs rollout/neutralized_ratio per rollout.

🤖 Generated with Claude Code

Adds --filter-zero-advantage-samples to drop samples whose post-processed reward is 0 before training (they contribute zero gradient). When fewer than dp_size non-zero samples survive, pad to dp_size with dropped zero-advantage samples and zero their loss masks via remove_sample. Restructures rollout logging so each wandb key has a single writer: - rollout-source aggregates (raw_reward, rewards, response_lengths, total_lengths) are emitted from RolloutManager._log_rollout_data on the pre-filter batch. - train-side log_rollout_data skip-lists those keys; it now only emits train artifacts (log_probs, advantages, kl, returns, values, etc.). - filter metrics live under rollout/filter/*. Refactors _convert_samples_to_train_data into a pure converter: callers must pass precomputed (raw_rewards, rewards). The custom convert hook signature changes to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR optimizes rollout→train data flow by (1) adding an option to drop zero-advantage samples (with padding back to dp_size when needed) and (2) moving rollout-derived aggregate metrics (raw_reward, rewards, response_lengths, total_lengths) to be logged on the rollout side so each W&B key has a single writer. It also updates the plugin hook contract so custom convert_samples_to_train_data implementations receive (samples, raw_rewards, rewards).

Changes:

Add --filter-zero-advantage-samples (requires --use-dynamic-global-batch-size) and apply filtering/padding before conversion to train data.
Split rollout vs train-side logging responsibility by logging reward/length aggregates in RolloutManager._log_rollout_data and skipping them in Megatron-side rollout logging.
Update custom convert hook signature and its contract test to accept (args, samples, raw_rewards, rewards).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
tests/plugin_contracts/test_plugin_runtime_hook_contracts.py	Updates the plugin contract test for the breaking hook signature change (convert hook now receives rewards inputs).
slime/utils/arguments.py	Adds the CLI flag + validation for zero-advantage filtering; updates help text for convert hook signature.
slime/ray/rollout.py	Computes rewards earlier, adds zero-advantage filtering/padding, refactors conversion signature, and moves rollout aggregates into rollout-side logging.
slime/backends/megatron_utils/data.py	Prevents duplicate W&B writers by skipping rollout-source aggregate keys that are now logged in `slime/ray/rollout.py`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        sample.metadata["raw_reward"] if sample.metadata and "raw_reward" in sample.metadata else sample.reward
+        for sample in samples


+        if self.args.filter_zero_advantage_samples:
+            data, raw_rewards, rewards = self._filter_zero_advantage_samples(data, raw_rewards, rewards)
+            self._dynamic_global_batch_size = self._compute_dynamic_global_batch_size(len(data))
+
+        if self.custom_convert_samples_to_train_data_func is not None:
+            data = self.custom_convert_samples_to_train_data_func(self.args, data, raw_rewards, rewards)
+        else:
+            data = self._convert_samples_to_train_data(data, raw_rewards, rewards)


Instead of dropping zero-advantage samples and dealing with dp_size partition complexity, mutate them in place: replace tokens with a 2-token pad sequence, set response_length=1, zero the loss mask. Forward pass becomes trivially cheap and gradient is zero, while sample count stays constant so all downstream partitioning works unchanged. Drops the --filter-zero-advantage-samples flag (now default) and its validator. Adds rollout/zero_advantage_ratio to track the underlying policy state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For samples whose post-processed reward is 0, replace tokens with a 2-token pad sequence and zero the loss mask. Forward pass is cheap and gradient is zero. Sample count is unchanged so downstream partitioning is untouched. Logs rollout/neutralized_ratio per rollout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…to main Earlier restore happened against a stale fetch, inadvertently reverting the upstream rollout_data_postprocess CI fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

loss_mask=[0] already gives the same effect during training; remove_sample would just retrigger the same zeroing inside _convert_samples_to_train_data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The existing convert loop re-zeros loss_mask when remove_sample is set. Keeping both protects the invariant against future refactors or plugin paths that might reset loss_mask between this loop and the convert loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Only these estimators compute per-token advantage as a scalar broadcast of rewards (advantage = rewards * ones). For them, r==0 implies the sample contributes zero gradient and is safe to neutralize. ppo (GAE) and reinforce_plus_plus (token-level kl rewards) can have non-zero advantage even when r==0, so they are excluded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nanjiangwill changed the title ~~Filter zero-advantage samples; split rollout/train logging boundary~~ filter zero-advantage samples; split rollout/train logging boundary May 11, 2026

zhuzilin requested a review from Copilot May 11, 2026 08:35

Copilot started reviewing on behalf of zhuzilin May 11, 2026 08:35 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

zhuzilin and others added 2 commits May 11, 2026 15:46

Merge branch 'main' into filter-zero-reward

137d7d7

nanjiangwill changed the title ~~filter zero-advantage samples; split rollout/train logging boundary~~ Neutralize zero-advantage samples; split rollout/train logging boundary May 11, 2026

nanjiangwill changed the title ~~Neutralize zero-advantage samples; split rollout/train logging boundary~~ Neutralize zero-advantage samples to skip wasted forward compute May 11, 2026

nanjiangwill and others added 4 commits May 11, 2026 10:15

Restore tests/plugin_contracts/test_plugin_runtime_hook_contracts.py …

1d41cf5

…to main Earlier restore happened against a stale fetch, inadvertently reverting the upstream rollout_data_postprocess CI fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop redundant remove_sample assignment

3e12919

loss_mask=[0] already gives the same effect during training; remove_sample would just retrigger the same zeroing inside _convert_samples_to_train_data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neutralize zero-advantage samples to skip wasted forward compute#1901

Neutralize zero-advantage samples to skip wasted forward compute#1901
nanjiangwill wants to merge 8 commits into
mainfrom
filter-zero-reward

nanjiangwill commented May 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		sample.metadata["raw_reward"] if sample.metadata and "raw_reward" in sample.metadata else sample.reward
		for sample in samples

Conversation

nanjiangwill commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nanjiangwill commented May 11, 2026 •

edited

Loading