Skip to content

Conversation

@SignalRT
Copy link
Collaborator

@SignalRT SignalRT commented Sep 27, 2025

Prototype implementation:

  • Minimally tested on macOS.
  • Tested unsuccessfully with CUDA 13 (seems to be an issue in llama.cpp itself).
  • Unit test
  • The test does not render images.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a comprehensive migration from the existing LLaVA multimodal architecture to a new MTMD (Multi-Modal Text+Data) implementation. The change introduces a more unified approach to handling multimodal inputs (images, audio, video) by replacing specialized LLaVA components with generic MTMD helpers that support multiple media types through a consistent tokenization and evaluation pipeline.

  • Migration from LLaVA-specific classes to generic MTMD wrapper classes
  • Introduction of new native API surface for MTMD tokenization and chunk-based evaluation
  • Updated executors to use MTMD tokenization instead of direct image embedding evaluation
  • Comprehensive test coverage for the new MTMD functionality

Reviewed Changes

Copilot reviewed 41 out of 41 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
SafeMtmdWeights.cs New wrapper class for MTMD multimodal weights replacing LLavaWeights
NativeApi.Mtmd.cs Native P/Invoke surface for MTMD helper functions
SafeMtmdModelHandle.cs Native handle management for MTMD models with tokenization and evaluation
SafeMtmdInputChunks.cs Managed wrapper for native chunk collections returned by tokenizer
SafeMtmdInputChunk.cs Individual chunk wrapper with metadata access and token span views
SafeMtmdEmbed.cs Media embedding wrapper supporting images, audio, and raw data buffers
LLamaInteractExecutor.cs Updated interactive executor to use MTMD tokenization workflow
LLamaInstructExecutor.cs Updated instruct executor with MTMD preprocessing logic
BatchedExecutor.cs Added MTMD batch evaluation support for batched inference
Conversation.cs Extended conversation class with multimodal prompting and media queueing

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Member

@martindevans martindevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the hard work putting this together! Lots of small review nitpicks, but overall this looks really solid 👍

@Webslug
Copy link

Webslug commented Oct 25, 2025

Version 25.0 just breaks multi modal capabilities.

Qwen2.5-VL-3B won't work at all.

How do we load the weights from other multimodal models?

System.DllNotFoundException: 'Unable to load DLL 'llava_shared' or one of its dependencies: The specified module could not be found. (0x8007007E)'

        string multiModalProj = "F:\\AI\\models\\Qwen2.5-VL-3B-Instruct-mmproj-f16.gguf";
        string modelPath = "F:\\AI\\models\\Qwen2.5-VL-3B-Instruct-q4_k_m.gguf";

        var parameters = new ModelParams(modelPath);
        NativeApi.llama_log_set((level, message) => { });
        Environment.SetEnvironmentVariable("LLAMA_LOG", "0");
        using var model = LLamaWeights.LoadFromFile(parameters);
        using var context = model.CreateContext(parameters);
        using var clipModel = LLavaWeights.LoadFromFile(multiModalProj);
        var executor = new InteractiveExecutor(context, clipModel, logger: null);

        var inferenceParams = new InferenceParams()
        {
            MaxTokens = 512,
            AntiPrompts = new List<string> { "\nUSER:" },
            SamplingPipeline = new DefaultSamplingPipeline { Temperature = 0.1f }
        };

@SignalRT
Copy link
Collaborator Author

@SignalRT , we've updated the master branch with new binaries and a couple of fixes that should solve the macos/osx CI issues. If you update, the tests should be passing again!

Branch updated; reviewing PR-breaking issues.

@SignalRT
Copy link
Collaborator Author

@SignalRT , we've updated the master branch with new binaries and a couple of fixes that should solve the macos/osx CI issues. If you update, the tests should be passing again!

Is it just me, or is the llama.cpp module not updated to the new binary version?

@m0nsky
Copy link
Contributor

m0nsky commented Dec 26, 2025

I think it's updated? It seems to point to 86587da on master, which is what the latest binaries were built with (but we recently rebuilt them with support for macos-14 and DCMAKE_BUILD_WITH_INSTALL_RPATH)

@SignalRT
Copy link
Collaborator Author

SignalRT commented Jan 10, 2026

Now seems to be working with this llama.cpp binaries: https://github.com/ggml-org/llama.cpp/releases/tag/b7679
I tested the solution only in osx for now.

I tested audio and video (image) models:

  • gemma-3-4b-it-Q4_K_M.gguf
  • Qwen2.5-Omni-3B-Q8_0.gguf
  • Qwen2.5-VL-7B-Instruct-Q8_0.gguf
  • Voxtral-Mini-3B-2507-Q4_K_M.gguf
  • Qwen3-VL-2B-Instruct-Q8_0.gguf
  • Qwen3VL-8B-Instruct-Q8_0.gguf

@SciSharp SciSharp deleted a comment from Copilot AI Jan 12, 2026
@SignalRT SignalRT requested a review from martindevans January 13, 2026 21:40
Copy link
Member

@martindevans martindevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This is a huge amount of work, thanks very much @SignalRT ❤️

@SignalRT SignalRT merged commit 435fe73 into SciSharp:master Jan 17, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants