Mtmd implementation #1261

SignalRT · 2025-09-27T16:47:24Z

Prototype implementation:

Minimally tested on macOS.
Tested unsuccessfully with CUDA 13 (seems to be an issue in llama.cpp itself).
Unit test
The test does not render images.

Copilot

Pull Request Overview

This PR implements a comprehensive migration from the existing LLaVA multimodal architecture to a new MTMD (Multi-Modal Text+Data) implementation. The change introduces a more unified approach to handling multimodal inputs (images, audio, video) by replacing specialized LLaVA components with generic MTMD helpers that support multiple media types through a consistent tokenization and evaluation pipeline.

Migration from LLaVA-specific classes to generic MTMD wrapper classes
Introduction of new native API surface for MTMD tokenization and chunk-based evaluation
Updated executors to use MTMD tokenization instead of direct image embedding evaluation
Comprehensive test coverage for the new MTMD functionality

Reviewed Changes

Copilot reviewed 41 out of 41 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
SafeMtmdWeights.cs	New wrapper class for MTMD multimodal weights replacing LLavaWeights
NativeApi.Mtmd.cs	Native P/Invoke surface for MTMD helper functions
SafeMtmdModelHandle.cs	Native handle management for MTMD models with tokenization and evaluation
SafeMtmdInputChunks.cs	Managed wrapper for native chunk collections returned by tokenizer
SafeMtmdInputChunk.cs	Individual chunk wrapper with metadata access and token span views
SafeMtmdEmbed.cs	Media embedding wrapper supporting images, audio, and raw data buffers
LLamaInteractExecutor.cs	Updated interactive executor to use MTMD tokenization workflow
LLamaInstructExecutor.cs	Updated instruct executor with MTMD preprocessing logic
BatchedExecutor.cs	Added MTMD batch evaluation support for batched inference
Conversation.cs	Extended conversation class with multimodal prompting and media queueing

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

LLama/Native/NativeApi.cs

LLama/Native/MtmdContextParams.cs

LLama/MtmdWeights.cs

LLama/Native/SafeMtmdModelHandle.cs

LLama/Batched/Conversation.cs

LLama/Batched/BatchedExecutor.cs

LLama/Batched/Conversation.cs

LLama/Batched/ConversationExtensions.cs

LLama/Native/MtmdContextParams.cs

LLama/Native/NativeApi.Mtmd.cs

LLama/Native/SafeMtmdEmbed.cs

LLama/Native/SafeMtmdInputChunk.cs

LLama/Native/SafeMtmdInputChunks.cs

LLama/SafeMtmdWeights.cs

martindevans

Thanks for all the hard work putting this together! Lots of small review nitpicks, but overall this looks really solid 👍

Webslug · 2025-10-25T17:09:39Z

Version 25.0 just breaks multi modal capabilities.

Qwen2.5-VL-3B won't work at all.

How do we load the weights from other multimodal models?

System.DllNotFoundException: 'Unable to load DLL 'llava_shared' or one of its dependencies: The specified module could not be found. (0x8007007E)'

        string multiModalProj = "F:\\AI\\models\\Qwen2.5-VL-3B-Instruct-mmproj-f16.gguf";
        string modelPath = "F:\\AI\\models\\Qwen2.5-VL-3B-Instruct-q4_k_m.gguf";

        var parameters = new ModelParams(modelPath);
        NativeApi.llama_log_set((level, message) => { });
        Environment.SetEnvironmentVariable("LLAMA_LOG", "0");
        using var model = LLamaWeights.LoadFromFile(parameters);
        using var context = model.CreateContext(parameters);
        using var clipModel = LLavaWeights.LoadFromFile(multiModalProj);
        var executor = new InteractiveExecutor(context, clipModel, logger: null);

        var inferenceParams = new InferenceParams()
        {
            MaxTokens = 512,
            AntiPrompts = new List<string> { "\nUSER:" },
            SamplingPipeline = new DefaultSamplingPipeline { Temperature = 0.1f }
        };

Co-authored-by: Copilot <[email protected]>

SignalRT · 2025-12-25T18:55:34Z

@SignalRT , we've updated the master branch with new binaries and a couple of fixes that should solve the macos/osx CI issues. If you update, the tests should be passing again!

Branch updated; reviewing PR-breaking issues.

SignalRT · 2025-12-25T20:53:52Z

@SignalRT , we've updated the master branch with new binaries and a couple of fixes that should solve the macos/osx CI issues. If you update, the tests should be passing again!

Is it just me, or is the llama.cpp module not updated to the new binary version?

m0nsky · 2025-12-26T08:34:27Z

I think it's updated? It seems to point to 86587da on master, which is what the latest binaries were built with (but we recently rebuilt them with support for macos-14 and DCMAKE_BUILD_WITH_INSTALL_RPATH)

…ive example

Change simple braces to doble braces solve the problem if the prompt has json format

SignalRT · 2026-01-10T15:52:24Z

Now seems to be working with this llama.cpp binaries: https://github.com/ggml-org/llama.cpp/releases/tag/b7679
I tested the solution only in osx for now.

I tested audio and video (image) models:

gemma-3-4b-it-Q4_K_M.gguf
Qwen2.5-Omni-3B-Q8_0.gguf
Qwen2.5-VL-7B-Instruct-Q8_0.gguf
Voxtral-Mini-3B-2507-Q4_K_M.gguf
Qwen3-VL-2B-Instruct-Q8_0.gguf
Qwen3VL-8B-Instruct-Q8_0.gguf

llama.cpp b7703 solves the issue

martindevans

LGTM! This is a huge amount of work, thanks very much @SignalRT ❤️

This was referenced Sep 27, 2025

System.DllNotFoundException: 'Unable to load DLL 'llava_shared' or one of its dependencies: The specified module could not be found. (0x8007007E)' #1255

Closed

Multimodal embedding #1193

Open

SignalRT force-pushed the mtmd_implementation branch from fcce175 to 9931d0e Compare September 28, 2025 14:54

SignalRT requested a review from Copilot September 28, 2025 15:51

Copilot AI reviewed Sep 28, 2025

View reviewed changes

SignalRT added a commit to SignalRT/LLamaSharp that referenced this pull request Sep 29, 2025

Resolve comment: SciSharp#1261 (comment)

3c92b07

SignalRT mentioned this pull request Sep 30, 2025

Qwen2.5-VL gguf model output garbled code #1194

Closed

SignalRT marked this pull request as ready for review October 5, 2025 12:27

SignalRT mentioned this pull request Oct 6, 2025

[BUG]: Error in version 0.25.0 - LLama.Exceptions.RuntimeError: Failed to load the native library. #1275

Closed

SignalRT requested a review from martindevans October 19, 2025 16:05