Skip to content

Dkorzekwa/any model other models#1007

Open
danielkorzekwa wants to merge 73 commits intofeature/puzzletronfrom
dkorzekwa/any_model_other_models
Open

Dkorzekwa/any model other models#1007
danielkorzekwa wants to merge 73 commits intofeature/puzzletronfrom
dkorzekwa/any_model_other_models

Conversation

@danielkorzekwa
Copy link

@danielkorzekwa danielkorzekwa commented Mar 9, 2026

What does this PR do?

Merging dkorzekwa/any_model_other_models into dkorzekwa/mip_and_realize_models - this MR is only for reviewing. Ultimately dkorzekwa/any_model_other_models should be merged into feature/puzzletron once dkorzekwa/mip_and_realize_models is merged there.

Summary by CodeRabbit

  • New Features

    • Added support for multiple model architectures: Mistral Small, Nemotron H, Nemotron H v2, Qwen2, Qwen3 8B, and Qwen3 VL 30B.
    • Introduced new pruning configurations and optimization pipelines for supported models.
    • Added comprehensive model descriptor framework enabling automated weight conversion and configuration handling.
    • Extended support for Mixture of Experts (MoE) models with expert removal pruning capabilities.
  • Tests

    • Enhanced test coverage with parametrized configurations for multiple model variants.

- Add converter, model_descriptor, puzzformer, and llama model support
- Selective merge of anymodel functionality

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…s merged)

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
@danielkorzekwa danielkorzekwa requested a review from a team as a code owner March 9, 2026 16:03
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py (1)

195-208: ⚠️ Potential issue | 🔴 Critical

Don't bypass the descriptor's trust_remote_code policy.

Lines 207-208 derive the policy from the descriptor, but Line 316 ignores it and hardcodes the AutoModel fallback to the permissive path. That reintroduces the same security gap the config-loading path already avoids.

🔒 Suggested fix
@@
     runtime = SimpleNamespace(
         device=torch.device(dist.local_rank()),
         dtype=torch.bfloat16,
         global_rank=dist.rank(),
         world_size=dist.size(),
         is_main_process=dist.is_master(),
         is_last_process=dist.is_last_process(),
         use_autocast=True,  # Default: use autocast; descriptor can override
+        trust_remote_code=descriptor.requires_trust_remote_code(),
     )
@@
     with runtime.device:
         if model_config is None:
-            trust_remote_code = descriptor.requires_trust_remote_code()
-            model_config = load_model_config(checkpoint_path, trust_remote_code=trust_remote_code)
+            model_config = load_model_config(
+                checkpoint_path, trust_remote_code=runtime.trust_remote_code
+            )
@@
         if model_class is AutoModelForCausalLM:
-            model = model_class.from_config(model_config, trust_remote_code=True)
+            model = model_class.from_config(
+                model_config, trust_remote_code=runtime.trust_remote_code
+            )
In the Hugging Face Transformers AutoModel `from_config` API, what does `trust_remote_code` control and what is its default value?

As per coding guidelines, trust_remote_code=True must be caller-configurable and default to False, not hardcoded.

Also applies to: 311-317

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py` around lines 195
- 208, The code currently ignores the descriptor's trust_remote_code policy when
falling back to AutoModel, reintroducing a security risk; change the
AutoModel.from_config (and any other Auto* fallback) calls to accept and
propagate a trust_remote_code flag derived from
descriptor.requires_trust_remote_code() (the same value used when calling
load_model_config), ensuring the default remains False and caller-configurable;
locate uses around the model_config loading path (where trust_remote_code is
computed) and the AutoModel.from_config fallback block and pass that
trust_remote_code variable through instead of hardcoding True.
🧹 Nitpick comments (2)
modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.py (1)

46-46: Avoid mutating caller-provided kwargs in place.

Line 46 mutates activation_hooks_kwargs directly. If the same dict is reused by the caller, state leaks across calls.

Suggested fix
-    activation_hooks_kwargs["model"] = model
+    base_activation_hooks_kwargs = {**activation_hooks_kwargs, "model": model}
@@
-        curr_activation_hooks_kwargs = {
-            **activation_hooks_kwargs,
+        curr_activation_hooks_kwargs = {
+            **base_activation_hooks_kwargs,
             "block_config": block_config,
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.py` at
line 46, Don't mutate the caller-provided activation_hooks_kwargs in place;
instead create a shallow copy of activation_hooks_kwargs, set the "model" key on
that copy, and use the copy for downstream use so the original dict passed by
the caller is not modified. Locate the code that sets
activation_hooks_kwargs["model"] = model and replace it with creating a new dict
from activation_hooks_kwargs (e.g., via copy or dict()) then assign
new_kwargs["model"] = model and use new_kwargs wherever activation_hooks_kwargs
would have been used. Ensure variable names like activation_hooks_kwargs and
model are used to find the exact spot.
modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py (1)

286-293: Also assert that buffers are off meta.

init_rotary_embedding() is responsible for recreating non-checkpointed buffers, but this postcondition only inspects parameters. A missed buffer initialization will pass here and fail later during forward.

🛡️ Suggested hardening
-    params_on_meta_device = [
-        param_name
-        for param_name, param in model_shard.named_parameters()
-        if param.device == torch.device("meta")
-    ]
-    assert len(params_on_meta_device) == 0, (
-        f"[global_rank={runtime.global_rank}]  Couldn't load params {params_on_meta_device}"
-    )
+    tensors_on_meta_device = [
+        *[
+            f"param:{name}"
+            for name, param in model_shard.named_parameters()
+            if param.device == torch.device("meta")
+        ],
+        *[
+            f"buffer:{name}"
+            for name, buffer in model_shard.named_buffers()
+            if buffer.device == torch.device("meta")
+        ],
+    ]
+    assert not tensors_on_meta_device, (
+        f"[global_rank={runtime.global_rank}] Couldn't materialize tensors {tensors_on_meta_device}"
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py` around lines 286
- 293, The postcondition currently only checks that parameters returned by
model_shard.named_parameters() are not on the meta device
(params_on_meta_device), but buffers can also remain on meta and later break
forward; update the check to also iterate model_shard.named_buffers() (e.g.,
buffers_on_meta_device) and assert none are on torch.device("meta") after
init_rotary_embedding() has run, including both parameter and buffer names in
the error message (use runtime.global_rank for context) so any missed buffer
initialization is detected early.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.py`:
- Around line 67-69: The current block_config lookup assumes
model.config.block_configs exists and block_idx is valid; guard this by first
checking hasattr/getattr for model.config.block_configs (or that
getattr(model.config, "block_configs", None) is not None), verify block_idx is
an int and within range (0 <= block_idx < len(model.config.block_configs))
before indexing, and fall back to None if any check fails; update the
block_config assignment logic around block_config/block_idx to perform these
validations to avoid AttributeError/IndexError.
- Around line 79-93: The current check only logs when the local rank has zero
activation_hooks, which lets a distributed job proceed if every rank has zero
hooks; change the logic so that when torch.distributed.is_available() and
torch.distributed.is_initialized() you compute the global sum of
len(activation_hooks) across ranks (e.g., create a tensor from
len(activation_hooks) and use torch.distributed.all_reduce) and then raise
ValueError if the global sum is 0; keep the existing local informational aprint
when local len is 0 but ensure the global check prevents silent success when all
ranks found zero hooks.

In `@modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py`:
- Around line 261-269: The dummy input-embedding creation uses
model_config.hidden_size directly which breaks for nested configs; change it to
obtain the LM sub-config via the same descriptor indirection used earlier (the
descriptor used at lines ~119-120 and ~210) and use that sub-config's
hidden_size when constructing DummyWTE in the branch guarded by
model_config.tie_word_embeddings, has_last_block and not has_first_block; keep
using descriptor.input_embedding_name(), set_submodule, DummyWTE and
runtime.dtype as before but replace model_config.hidden_size with the LM
sub-config's hidden_size (e.g., lm_cfg.hidden_size).

---

Duplicate comments:
In `@modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py`:
- Around line 195-208: The code currently ignores the descriptor's
trust_remote_code policy when falling back to AutoModel, reintroducing a
security risk; change the AutoModel.from_config (and any other Auto* fallback)
calls to accept and propagate a trust_remote_code flag derived from
descriptor.requires_trust_remote_code() (the same value used when calling
load_model_config), ensuring the default remains False and caller-configurable;
locate uses around the model_config loading path (where trust_remote_code is
computed) and the AutoModel.from_config fallback block and pass that
trust_remote_code variable through instead of hardcoding True.

---

Nitpick comments:
In `@modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.py`:
- Line 46: Don't mutate the caller-provided activation_hooks_kwargs in place;
instead create a shallow copy of activation_hooks_kwargs, set the "model" key on
that copy, and use the copy for downstream use so the original dict passed by
the caller is not modified. Locate the code that sets
activation_hooks_kwargs["model"] = model and replace it with creating a new dict
from activation_hooks_kwargs (e.g., via copy or dict()) then assign
new_kwargs["model"] = model and use new_kwargs wherever activation_hooks_kwargs
would have been used. Ensure variable names like activation_hooks_kwargs and
model are used to find the exact spot.

In `@modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py`:
- Around line 286-293: The postcondition currently only checks that parameters
returned by model_shard.named_parameters() are not on the meta device
(params_on_meta_device), but buffers can also remain on meta and later break
forward; update the check to also iterate model_shard.named_buffers() (e.g.,
buffers_on_meta_device) and assert none are on torch.device("meta") after
init_rotary_embedding() has run, including both parameter and buffer names in
the error message (use runtime.global_rank for context) so any missed buffer
initialization is detected early.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 19b771aa-f140-4183-a756-b89f4e02048b

📥 Commits

Reviewing files that changed from the base of the PR and between be825bc and 7d7b609.

📒 Files selected for processing (2)
  • modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.py
  • modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py

Comment on lines +67 to +69
block_config = None
if block_idx is not None:
block_config = model.config.block_configs[block_idx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard block_config lookup before indexing.

Line 69 assumes model.config.block_configs exists and that block_idx is in range. That can raise AttributeError/IndexError at runtime for incompatible configs.

Suggested fix
-        block_config = None
-        if block_idx is not None:
-            block_config = model.config.block_configs[block_idx]
+        block_config = None
+        if block_idx is not None:
+            block_configs = getattr(getattr(model, "config", None), "block_configs", None)
+            if block_configs is None or not (0 <= block_idx < len(block_configs)):
+                raise ValueError(
+                    f"Invalid block_idx={block_idx} for model block_configs."
+                )
+            block_config = block_configs[block_idx]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.py`
around lines 67 - 69, The current block_config lookup assumes
model.config.block_configs exists and block_idx is valid; guard this by first
checking hasattr/getattr for model.config.block_configs (or that
getattr(model.config, "block_configs", None) is not None), verify block_idx is
an int and within range (0 <= block_idx < len(model.config.block_configs))
before indexing, and fall back to None if any check fails; update the
block_config assignment logic around block_config/block_idx to perform these
validations to avoid AttributeError/IndexError.

Comment on lines +79 to +93
if len(activation_hooks) == 0:
# In distributed mode, it's okay for a rank to have 0 hooks if it doesn't own
# the target modules (e.g., with hybrid patterns like "*-" where different
# ranks own different layer types). However, we still want to catch real bugs
# where no hooks are found at all.
is_distributed = torch.distributed.is_available() and torch.distributed.is_initialized()
if is_distributed:
aprint(
"No hooks registered on this rank. This is expected if this rank "
"doesn't own any layers matching the hook pattern (e.g., in hybrid "
"patterns with distributed model sharding)."
)
else:
raise ValueError("couldn't find any hooks")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Distributed mode can silently proceed with zero hooks globally.

On Lines 85-90, distributed runs only log when local hooks are zero. If all ranks have zero hooks, this returns successfully and downstream scoring can run without instrumentation.

Suggested fix
     if len(activation_hooks) == 0:
@@
-        if is_distributed:
-            aprint(
-                "No hooks registered on this rank. This is expected if this rank "
-                "doesn't own any layers matching the hook pattern (e.g., in hybrid "
-                "patterns with distributed model sharding)."
-            )
+        if is_distributed:
+            local_count = torch.tensor([0], device="cuda" if torch.cuda.is_available() else "cpu")
+            global_count = local_count.clone()
+            torch.distributed.all_reduce(global_count, op=torch.distributed.ReduceOp.SUM)
+            if global_count.item() == 0:
+                raise ValueError("couldn't find any hooks on any distributed rank")
+            aprint(
+                "No hooks registered on this rank. This is expected if this rank "
+                "doesn't own any layers matching the hook pattern (e.g., in hybrid "
+                "patterns with distributed model sharding)."
+            )
         else:
             raise ValueError("couldn't find any hooks")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/activation_scoring/activation_hooks/utils.py`
around lines 79 - 93, The current check only logs when the local rank has zero
activation_hooks, which lets a distributed job proceed if every rank has zero
hooks; change the logic so that when torch.distributed.is_available() and
torch.distributed.is_initialized() you compute the global sum of
len(activation_hooks) across ranks (e.g., create a tensor from
len(activation_hooks) and use torch.distributed.all_reduce) and then raise
ValueError if the global sum is 0; keep the existing local informational aprint
when local len is 0 but ensure the global check prevents silent success when all
ranks found zero hooks.

Comment on lines +261 to +269
# On the last rank with tied embeddings, we kept embed_tokens in create_local_shard_()
# just to load the weight and tie it to lm_head. Now replace it with a dummy so it
# doesn't interfere with the pipeline forward pass (only rank 0 should run embed_tokens).
if model_config.tie_word_embeddings and has_last_block and not has_first_block:
set_submodule(
model_shard,
descriptor.input_embedding_name(),
DummyWTE(model_config.hidden_size, dtype=runtime.dtype),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the LM sub-config when recreating the dummy input embedding.

This branch bypasses the descriptor indirection you already use at Lines 119-120 and 210. On nested configs, model_config.hidden_size is not guaranteed to exist, so the last-rank tied-embedding path can fail even though the rest of the sharding code is descriptor-aware.

🧩 Suggested fix
             if model_config.tie_word_embeddings and has_last_block and not has_first_block:
+                lm_config = descriptor.get_language_model_config(model_config)
                 set_submodule(
                     model_shard,
                     descriptor.input_embedding_name(),
-                    DummyWTE(model_config.hidden_size, dtype=runtime.dtype),
+                    DummyWTE(lm_config.hidden_size, dtype=runtime.dtype),
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py` around lines 261
- 269, The dummy input-embedding creation uses model_config.hidden_size directly
which breaks for nested configs; change it to obtain the LM sub-config via the
same descriptor indirection used earlier (the descriptor used at lines ~119-120
and ~210) and use that sub-config's hidden_size when constructing DummyWTE in
the branch guarded by model_config.tie_word_embeddings, has_last_block and not
has_first_block; keep using descriptor.input_embedding_name(), set_submodule,
DummyWTE and runtime.dtype as before but replace model_config.hidden_size with
the LM sub-config's hidden_size (e.g., lm_cfg.hidden_size).

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.12%. Comparing base (8fe318d) to head (631306c).
⚠️ Report is 1 commits behind head on feature/puzzletron.

Additional details and impacted files
@@                  Coverage Diff                   @@
##           feature/puzzletron    #1007      +/-   ##
======================================================
+ Coverage               72.10%   72.12%   +0.02%     
======================================================
  Files                     209      209              
  Lines                   23628    23628              
======================================================
+ Hits                    17036    17042       +6     
+ Misses                   6592     6586       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…zzletron (nemotron-3-nano-30b-a3b-base-bf16)

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
… if now test_puzzletron.py will be repeatable.

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we call this nemotron_h and not nemotron_h_v3? Do we know if this will be same for v4 as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the names are changing so fast, I added to TODO to unify it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is qwen3_8b and not qwen3? All other models have generic converter not specific to one specific variant

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A model descriptor can be specific, and sometimes within the same model family across different sizes could be differences, e.g., in how model weights are named, or structured. This one was only tested on qwen3 8B, therefore named this way for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment - why qwen3_vl_30b and not qwen3_vl?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because tested only for this particular model

if rank == 1:
# Log all inputs to values_to_node
aprint(
f"[DEBUG PASSAGE] Rank {rank}: Before passage forward, values_to_node has {len(values_to_node)} entries"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need all these debug messages?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was for test_puzzletron debugging, I removed those debugging commits from repo

Comment on lines +50 to +59
try:
from modelopt.torch.puzzletron.tools.logger import aprint
except ImportError:
# Fallback if logger is not available
def aprint(msg: str | None):
if torch.distributed.is_initialized():
rank = torch.distributed.get_rank()
print(f"[Rank {rank}] {msg}", flush=True)
else:
print(msg, flush=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we instead just importing localls in the function instead of re-defining again?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was for debugging - removed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
child_model = model_class.from_config(child_model_config, trust_remote_code=descriptor.requires_trust_remote_code())

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

state_dict_inf_count += inf_count
if nan_count > 0 or inf_count > 0:
aprint(
f"[DEBUG LOAD] Rank {rank_to_check}: State dict key '{key}' contains "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need these debug statements anymore?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was a temp debug, removed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model = model_class.from_config(model_config, trust_remote_code=descriptor.requires_trust_remote_code())

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@danielkorzekwa danielkorzekwa force-pushed the dkorzekwa/any_model_other_models branch from 3866125 to 27866de Compare March 16, 2026 19:05
    # This prevents NaN values in uninitialized parameters (e.g., backbone.layers.1.mixer.gate.weight
    # in nemotron-3-nano-30b-a3b-base-bf16) that can occur with from_config on RTX GPU cards (not on H100)

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…reproducible on CI)

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants