What is multimodal embedding finetuning?

Multimodal embedding finetuning adapts a pre-trained model that processes text, images, audio, or video to perform better on a specific task, such as visual document retrieval. By training on domain-specific data, the model learns specialized patterns like layout understanding and table matching, improving retrieval accuracy.

How do you finetune a Sentence Transformers model for images?

You use the same SentenceTransformerTrainer as text-only models, but your dataset includes images alongside text. The model's processor automatically handles image preprocessing, and you can control image quality via processor_kwargs like min_pixels and max_pixels. The training pipeline remains identical otherwise.

What is NDCG improvement in visual document retrieval?

NDCG (Normalized Discounted Cumulative Gain) measures ranking quality. In the example walkthrough, finetuning Qwen3-VL-Embedding-2B for visual document retrieval improved NDCG@10 from 0.888 to 0.947, a 6.6% gain that outperforms models up to 4x larger.

Finetune Multimodal Embedding Models with Sentence Transformers: NDCG +6.6%

Sentence Transformers has been my go-to library for embedding and reranker models for a while now. Last month, Tom Aarsen introduced multimodal support—text, images, audio, video—which opened up a lot of interesting use cases. But the real power comes when you finetune these models on your own data. That’s what I want to walk through here.

I’ll use a concrete example: finetuning Qwen/Qwen3-VL-Embedding-2B for Visual Document Retrieval (VDR). The task is simple in concept but hard in practice: given a text query like “What was the company’s Q3 revenue?”, find the right document page (as an image with charts, tables, layout intact) from a large corpus. The base model is decent, but after finetuning, NDCG@10 jumped from 0.888 to 0.947, beating models up to 4x its size.

Why Bother Finetuning?

General-purpose multimodal models like Qwen’s embedding model are trained on everything under the sun: image-text matching, visual QA, document understanding, you name it. That breadth comes at a cost. They’re rarely the best at any single task.

Visual Document Retrieval is a great example. Matching a text query to a document screenshot requires understanding layout, tables, charts, and how text flows across a page. That’s a completely different skill from matching a product photo to its description. Finetuning on domain-specific data teaches the model these specialized patterns, and the results speak for themselves.

The Training Pipeline

The training components are the same as text-only Sentence Transformer training:

Model: The multimodal model you want to finetune
Dataset: Your training and evaluation data
Loss Function: What guides optimization
Training Arguments: Performance and logging settings
Evaluator: Optional, but highly recommended
Trainer: Ties everything together

The multimodal pipeline uses the same SentenceTransformerTrainer as text-only. The main difference is that your datasets include images (or other modalities) alongside text, and the model’s processor handles preprocessing automatically.

Setting Up the Model

You can either finetune an existing multimodal embedding model or start from a fresh VLM checkpoint. For the Qwen model, I used:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-VL-Embedding-2B",
    model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
    processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600},
)

The processor_kwargs control image preprocessing. Higher max_pixels means better quality but more memory. The model_kwargs handle precision and attention implementation. Flash Attention 2 is worth using if you have compatible hardware.

If you’re starting from a VLM that hasn’t been trained for embeddings yet (e.g., Qwen/Qwen3-VL-2B), Sentence Transformers will try to detect the architecture, infer supported modalities from the processor, and set up pooling automatically. You can check what it figured out:

print(model.modalities)
print(model.supports("image"))

If the auto-detection doesn’t work perfectly for some model, you can edit the saved sentence_bert_config.json to adjust modality settings and forward methods.

Building the Dataset

For Visual Document Retrieval, you need pairs of text queries and relevant document images. The dataset format is straightforward: each example has a text query and one or more positive document images. You can also include hard negatives for better training.

The Hugging Face datasets library handles this cleanly. You load images as PIL objects or paths, and Sentence Transformers processes them automatically. Here’s the rough structure:

{
    "query": "What was the company's Q3 revenue?",
    "positive_images": ["doc_page_42.png"],
    "negative_images": ["doc_page_13.png", "doc_page_87.png"]
}

The model’s processor handles resizing, normalization, and any modality-specific preprocessing.

Choosing the Loss Function

For embedding models, CachedMultipleNegativesRankingLoss is the workhorse. It’s designed for retrieval tasks where you have a query and a set of candidates. The loss encourages the model to rank the positive candidate higher than negatives.

I also added MatryoshkaLoss on top. This lets the model produce embeddings at multiple dimensions (e.g., 256, 512, 1024, 2048) simultaneously. The benefit is flexibility: you can use smaller embeddings for faster search when quality requirements are lower, and full embeddings when you need maximum accuracy. During training, the model learns to pack information efficiently at all target dimensions.

Training Arguments

Standard stuff here: learning rate, batch size, warmup steps, evaluation strategy. I used a learning rate of 2e-5 with linear warmup over 10% of steps. Batch size depends on your GPU memory—the Qwen 2B model fits comfortably on a 24GB card with a batch size of 8-16.

One thing worth noting: multimodal models are memory-hungry because they process images. Monitor your GPU usage and adjust max_pixels if needed.

Evaluation

The evaluator runs during training to track progress. For retrieval tasks, I used NDCG@10 and Recall@10. The evaluator computes embeddings for all queries and candidates, then measures retrieval quality. This is the same metric used in the final comparison.

Results

The finetuned model (tomaarsen/Qwen3-VL-Embedding-2B-vdr) achieved an NDCG@10 of 0.947, up from 0.888 for the base model. That’s a 6.6% absolute improvement, which is substantial for retrieval tasks. It outperformed all existing VDR models I tested, including some 4x larger.

The Matryoshka dimensions also showed interesting behavior. At the smallest dimension (256), performance was still strong, suggesting the model learned efficient representations. The full 2048-dimension embeddings gave the best results, but the 512-dimension version was close behind, offering a good speed-quality tradeoff.

Training Reranker Models

Reranker models follow a similar pattern but use a different loss function and training setup. Instead of producing embeddings, rerankers take a query-candidate pair and output a relevance score. The training pipeline is simpler in some ways because you don’t need to worry about embedding dimensions or pooling.

For multimodal rerankers, the model processes both the query and the candidate (which can be an image) together, then outputs a score. This is more expensive at inference time (you can’t precompute embeddings), but often gives better ranking quality.

Final Thoughts

The multimodal capabilities in Sentence Transformers have matured nicely. The training pipeline is clean, the integration with Hugging Face models is seamless, and the results are impressive. If you’re working on any retrieval task that involves images—document retrieval, product search, visual QA—finetuning is absolutely worth the effort.

The gap between a general-purpose model and a finetuned one is bigger than I expected. A 6.6% NDCG improvement on an already strong model is significant, and the finetuned model beating models 4x its size shows that domain-specific training matters more than raw parameter count.

If you want to try it yourself, the code and dataset are available on GitHub. The training script is around 100 lines, and you can adapt it to your own data with minimal changes.

Finetuning Multimodal Embedding Models with Sentence Transformers: A Practical Walkthrough