Dkorzekwa/any model other models#1007
Dkorzekwa/any model other models#1007danielkorzekwa wants to merge 73 commits intofeature/puzzletronfrom
Conversation
- Add converter, model_descriptor, puzzformer, and llama model support - Selective merge of anymodel functionality Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…s merged) Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py (1)
195-208:⚠️ Potential issue | 🔴 CriticalDon't bypass the descriptor's
trust_remote_codepolicy.Lines 207-208 derive the policy from the descriptor, but Line 316 ignores it and hardcodes the AutoModel fallback to the permissive path. That reintroduces the same security gap the config-loading path already avoids.
🔒 Suggested fix
@@ runtime = SimpleNamespace( device=torch.device(dist.local_rank()), dtype=torch.bfloat16, global_rank=dist.rank(), world_size=dist.size(), is_main_process=dist.is_master(), is_last_process=dist.is_last_process(), use_autocast=True, # Default: use autocast; descriptor can override + trust_remote_code=descriptor.requires_trust_remote_code(), ) @@ with runtime.device: if model_config is None: - trust_remote_code = descriptor.requires_trust_remote_code() - model_config = load_model_config(checkpoint_path, trust_remote_code=trust_remote_code) + model_config = load_model_config( + checkpoint_path, trust_remote_code=runtime.trust_remote_code + ) @@ if model_class is AutoModelForCausalLM: - model = model_class.from_config(model_config, trust_remote_code=True) + model = model_class.from_config( + model_config, trust_remote_code=runtime.trust_remote_code + )In the Hugging Face Transformers AutoModel `from_config` API, what does `trust_remote_code` control and what is its default value?As per coding guidelines,
trust_remote_code=Truemust be caller-configurable and default toFalse, not hardcoded.Also applies to: 311-317
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py` around lines 195 - 208, The code currently ignores the descriptor's trust_remote_code policy when falling back to AutoModel, reintroducing a security risk; change the AutoModel.from_config (and any other Auto* fallback) calls to accept and propagate a trust_remote_code flag derived from descriptor.requires_trust_remote_code() (the same value used when calling load_model_config), ensuring the default remains False and caller-configurable; locate uses around the model_config loading path (where trust_remote_code is computed) and the AutoModel.from_config fallback block and pass that trust_remote_code variable through instead of hardcoding True.
🧹 Nitpick comments (2)
modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.py (1)
46-46: Avoid mutating caller-provided kwargs in place.Line 46 mutates
activation_hooks_kwargsdirectly. If the same dict is reused by the caller, state leaks across calls.Suggested fix
- activation_hooks_kwargs["model"] = model + base_activation_hooks_kwargs = {**activation_hooks_kwargs, "model": model} @@ - curr_activation_hooks_kwargs = { - **activation_hooks_kwargs, + curr_activation_hooks_kwargs = { + **base_activation_hooks_kwargs, "block_config": block_config, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.py` at line 46, Don't mutate the caller-provided activation_hooks_kwargs in place; instead create a shallow copy of activation_hooks_kwargs, set the "model" key on that copy, and use the copy for downstream use so the original dict passed by the caller is not modified. Locate the code that sets activation_hooks_kwargs["model"] = model and replace it with creating a new dict from activation_hooks_kwargs (e.g., via copy or dict()) then assign new_kwargs["model"] = model and use new_kwargs wherever activation_hooks_kwargs would have been used. Ensure variable names like activation_hooks_kwargs and model are used to find the exact spot.modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py (1)
286-293: Also assert that buffers are off meta.
init_rotary_embedding()is responsible for recreating non-checkpointed buffers, but this postcondition only inspects parameters. A missed buffer initialization will pass here and fail later during forward.🛡️ Suggested hardening
- params_on_meta_device = [ - param_name - for param_name, param in model_shard.named_parameters() - if param.device == torch.device("meta") - ] - assert len(params_on_meta_device) == 0, ( - f"[global_rank={runtime.global_rank}] Couldn't load params {params_on_meta_device}" - ) + tensors_on_meta_device = [ + *[ + f"param:{name}" + for name, param in model_shard.named_parameters() + if param.device == torch.device("meta") + ], + *[ + f"buffer:{name}" + for name, buffer in model_shard.named_buffers() + if buffer.device == torch.device("meta") + ], + ] + assert not tensors_on_meta_device, ( + f"[global_rank={runtime.global_rank}] Couldn't materialize tensors {tensors_on_meta_device}" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py` around lines 286 - 293, The postcondition currently only checks that parameters returned by model_shard.named_parameters() are not on the meta device (params_on_meta_device), but buffers can also remain on meta and later break forward; update the check to also iterate model_shard.named_buffers() (e.g., buffers_on_meta_device) and assert none are on torch.device("meta") after init_rotary_embedding() has run, including both parameter and buffer names in the error message (use runtime.global_rank for context) so any missed buffer initialization is detected early.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.py`:
- Around line 67-69: The current block_config lookup assumes
model.config.block_configs exists and block_idx is valid; guard this by first
checking hasattr/getattr for model.config.block_configs (or that
getattr(model.config, "block_configs", None) is not None), verify block_idx is
an int and within range (0 <= block_idx < len(model.config.block_configs))
before indexing, and fall back to None if any check fails; update the
block_config assignment logic around block_config/block_idx to perform these
validations to avoid AttributeError/IndexError.
- Around line 79-93: The current check only logs when the local rank has zero
activation_hooks, which lets a distributed job proceed if every rank has zero
hooks; change the logic so that when torch.distributed.is_available() and
torch.distributed.is_initialized() you compute the global sum of
len(activation_hooks) across ranks (e.g., create a tensor from
len(activation_hooks) and use torch.distributed.all_reduce) and then raise
ValueError if the global sum is 0; keep the existing local informational aprint
when local len is 0 but ensure the global check prevents silent success when all
ranks found zero hooks.
In `@modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py`:
- Around line 261-269: The dummy input-embedding creation uses
model_config.hidden_size directly which breaks for nested configs; change it to
obtain the LM sub-config via the same descriptor indirection used earlier (the
descriptor used at lines ~119-120 and ~210) and use that sub-config's
hidden_size when constructing DummyWTE in the branch guarded by
model_config.tie_word_embeddings, has_last_block and not has_first_block; keep
using descriptor.input_embedding_name(), set_submodule, DummyWTE and
runtime.dtype as before but replace model_config.hidden_size with the LM
sub-config's hidden_size (e.g., lm_cfg.hidden_size).
---
Duplicate comments:
In `@modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py`:
- Around line 195-208: The code currently ignores the descriptor's
trust_remote_code policy when falling back to AutoModel, reintroducing a
security risk; change the AutoModel.from_config (and any other Auto* fallback)
calls to accept and propagate a trust_remote_code flag derived from
descriptor.requires_trust_remote_code() (the same value used when calling
load_model_config), ensuring the default remains False and caller-configurable;
locate uses around the model_config loading path (where trust_remote_code is
computed) and the AutoModel.from_config fallback block and pass that
trust_remote_code variable through instead of hardcoding True.
---
Nitpick comments:
In `@modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.py`:
- Line 46: Don't mutate the caller-provided activation_hooks_kwargs in place;
instead create a shallow copy of activation_hooks_kwargs, set the "model" key on
that copy, and use the copy for downstream use so the original dict passed by
the caller is not modified. Locate the code that sets
activation_hooks_kwargs["model"] = model and replace it with creating a new dict
from activation_hooks_kwargs (e.g., via copy or dict()) then assign
new_kwargs["model"] = model and use new_kwargs wherever activation_hooks_kwargs
would have been used. Ensure variable names like activation_hooks_kwargs and
model are used to find the exact spot.
In `@modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py`:
- Around line 286-293: The postcondition currently only checks that parameters
returned by model_shard.named_parameters() are not on the meta device
(params_on_meta_device), but buffers can also remain on meta and later break
forward; update the check to also iterate model_shard.named_buffers() (e.g.,
buffers_on_meta_device) and assert none are on torch.device("meta") after
init_rotary_embedding() has run, including both parameter and buffer names in
the error message (use runtime.global_rank for context) so any missed buffer
initialization is detected early.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 19b771aa-f140-4183-a756-b89f4e02048b
📒 Files selected for processing (2)
modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.pymodelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py
| block_config = None | ||
| if block_idx is not None: | ||
| block_config = model.config.block_configs[block_idx] |
There was a problem hiding this comment.
Guard block_config lookup before indexing.
Line 69 assumes model.config.block_configs exists and that block_idx is in range. That can raise AttributeError/IndexError at runtime for incompatible configs.
Suggested fix
- block_config = None
- if block_idx is not None:
- block_config = model.config.block_configs[block_idx]
+ block_config = None
+ if block_idx is not None:
+ block_configs = getattr(getattr(model, "config", None), "block_configs", None)
+ if block_configs is None or not (0 <= block_idx < len(block_configs)):
+ raise ValueError(
+ f"Invalid block_idx={block_idx} for model block_configs."
+ )
+ block_config = block_configs[block_idx]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.py`
around lines 67 - 69, The current block_config lookup assumes
model.config.block_configs exists and block_idx is valid; guard this by first
checking hasattr/getattr for model.config.block_configs (or that
getattr(model.config, "block_configs", None) is not None), verify block_idx is
an int and within range (0 <= block_idx < len(model.config.block_configs))
before indexing, and fall back to None if any check fails; update the
block_config assignment logic around block_config/block_idx to perform these
validations to avoid AttributeError/IndexError.
| if len(activation_hooks) == 0: | ||
| # In distributed mode, it's okay for a rank to have 0 hooks if it doesn't own | ||
| # the target modules (e.g., with hybrid patterns like "*-" where different | ||
| # ranks own different layer types). However, we still want to catch real bugs | ||
| # where no hooks are found at all. | ||
| is_distributed = torch.distributed.is_available() and torch.distributed.is_initialized() | ||
| if is_distributed: | ||
| aprint( | ||
| "No hooks registered on this rank. This is expected if this rank " | ||
| "doesn't own any layers matching the hook pattern (e.g., in hybrid " | ||
| "patterns with distributed model sharding)." | ||
| ) | ||
| else: | ||
| raise ValueError("couldn't find any hooks") | ||
|
|
There was a problem hiding this comment.
Distributed mode can silently proceed with zero hooks globally.
On Lines 85-90, distributed runs only log when local hooks are zero. If all ranks have zero hooks, this returns successfully and downstream scoring can run without instrumentation.
Suggested fix
if len(activation_hooks) == 0:
@@
- if is_distributed:
- aprint(
- "No hooks registered on this rank. This is expected if this rank "
- "doesn't own any layers matching the hook pattern (e.g., in hybrid "
- "patterns with distributed model sharding)."
- )
+ if is_distributed:
+ local_count = torch.tensor([0], device="cuda" if torch.cuda.is_available() else "cpu")
+ global_count = local_count.clone()
+ torch.distributed.all_reduce(global_count, op=torch.distributed.ReduceOp.SUM)
+ if global_count.item() == 0:
+ raise ValueError("couldn't find any hooks on any distributed rank")
+ aprint(
+ "No hooks registered on this rank. This is expected if this rank "
+ "doesn't own any layers matching the hook pattern (e.g., in hybrid "
+ "patterns with distributed model sharding)."
+ )
else:
raise ValueError("couldn't find any hooks")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.py`
around lines 79 - 93, The current check only logs when the local rank has zero
activation_hooks, which lets a distributed job proceed if every rank has zero
hooks; change the logic so that when torch.distributed.is_available() and
torch.distributed.is_initialized() you compute the global sum of
len(activation_hooks) across ranks (e.g., create a tensor from
len(activation_hooks) and use torch.distributed.all_reduce) and then raise
ValueError if the global sum is 0; keep the existing local informational aprint
when local len is 0 but ensure the global check prevents silent success when all
ranks found zero hooks.
| # On the last rank with tied embeddings, we kept embed_tokens in create_local_shard_() | ||
| # just to load the weight and tie it to lm_head. Now replace it with a dummy so it | ||
| # doesn't interfere with the pipeline forward pass (only rank 0 should run embed_tokens). | ||
| if model_config.tie_word_embeddings and has_last_block and not has_first_block: | ||
| set_submodule( | ||
| model_shard, | ||
| descriptor.input_embedding_name(), | ||
| DummyWTE(model_config.hidden_size, dtype=runtime.dtype), | ||
| ) |
There was a problem hiding this comment.
Use the LM sub-config when recreating the dummy input embedding.
This branch bypasses the descriptor indirection you already use at Lines 119-120 and 210. On nested configs, model_config.hidden_size is not guaranteed to exist, so the last-rank tied-embedding path can fail even though the rest of the sharding code is descriptor-aware.
🧩 Suggested fix
if model_config.tie_word_embeddings and has_last_block and not has_first_block:
+ lm_config = descriptor.get_language_model_config(model_config)
set_submodule(
model_shard,
descriptor.input_embedding_name(),
- DummyWTE(model_config.hidden_size, dtype=runtime.dtype),
+ DummyWTE(lm_config.hidden_size, dtype=runtime.dtype),
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py` around lines 261
- 269, The dummy input-embedding creation uses model_config.hidden_size directly
which breaks for nested configs; change it to obtain the LM sub-config via the
same descriptor indirection used earlier (the descriptor used at lines ~119-120
and ~210) and use that sub-config's hidden_size when constructing DummyWTE in
the branch guarded by model_config.tie_word_embeddings, has_last_block and not
has_first_block; keep using descriptor.input_embedding_name(), set_submodule,
DummyWTE and runtime.dtype as before but replace model_config.hidden_size with
the LM sub-config's hidden_size (e.g., lm_cfg.hidden_size).
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## feature/puzzletron #1007 +/- ##
======================================================
+ Coverage 72.10% 72.12% +0.02%
======================================================
Files 209 209
Lines 23628 23628
======================================================
+ Hits 17036 17042 +6
+ Misses 6592 6586 -6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…zzletron (nemotron-3-nano-30b-a3b-base-bf16) Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
… if now test_puzzletron.py will be repeatable. Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
There was a problem hiding this comment.
why do we call this nemotron_h and not nemotron_h_v3? Do we know if this will be same for v4 as well?
There was a problem hiding this comment.
the names are changing so fast, I added to TODO to unify it.
There was a problem hiding this comment.
Why this is qwen3_8b and not qwen3? All other models have generic converter not specific to one specific variant
There was a problem hiding this comment.
A model descriptor can be specific, and sometimes within the same model family across different sizes could be differences, e.g., in how model weights are named, or structured. This one was only tested on qwen3 8B, therefore named this way for now.
There was a problem hiding this comment.
same comment - why qwen3_vl_30b and not qwen3_vl?
There was a problem hiding this comment.
because tested only for this particular model
| if rank == 1: | ||
| # Log all inputs to values_to_node | ||
| aprint( | ||
| f"[DEBUG PASSAGE] Rank {rank}: Before passage forward, values_to_node has {len(values_to_node)} entries" |
There was a problem hiding this comment.
do we still need all these debug messages?
There was a problem hiding this comment.
it was for test_puzzletron debugging, I removed those debugging commits from repo
| try: | ||
| from modelopt.torch.puzzletron.tools.logger import aprint | ||
| except ImportError: | ||
| # Fallback if logger is not available | ||
| def aprint(msg: str | None): | ||
| if torch.distributed.is_initialized(): | ||
| rank = torch.distributed.get_rank() | ||
| print(f"[Rank {rank}] {msg}", flush=True) | ||
| else: | ||
| print(msg, flush=True) |
There was a problem hiding this comment.
can we instead just importing localls in the function instead of re-defining again?
There was a problem hiding this comment.
it was for debugging - removed
There was a problem hiding this comment.
| child_model = model_class.from_config(child_model_config, trust_remote_code=descriptor.requires_trust_remote_code()) |
| state_dict_inf_count += inf_count | ||
| if nan_count > 0 or inf_count > 0: | ||
| aprint( | ||
| f"[DEBUG LOAD] Rank {rank_to_check}: State dict key '{key}' contains " |
There was a problem hiding this comment.
do we need these debug statements anymore?
There was a problem hiding this comment.
it was a temp debug, removed
There was a problem hiding this comment.
| model = model_class.from_config(model_config, trust_remote_code=descriptor.requires_trust_remote_code()) |
3866125 to
27866de
Compare
# This prevents NaN values in uninitialized parameters (e.g., backbone.layers.1.mixer.gate.weight
# in nemotron-3-nano-30b-a3b-base-bf16) that can occur with from_config on RTX GPU cards (not on H100)
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…reproducible on CI) Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
What does this PR do?
Merging dkorzekwa/any_model_other_models into dkorzekwa/mip_and_realize_models - this MR is only for reviewing. Ultimately dkorzekwa/any_model_other_models should be merged into feature/puzzletron once dkorzekwa/mip_and_realize_models is merged there.
Summary by CodeRabbit
New Features
Tests