Small Yet Mighty: Improve Accuracy In Multimodal Search and Visual Document Retrieval with Llama Nemotron RAG Models

Community Article Published January 6, 2026

Upvote

Ronay Ak

ronay-nv

nvidia

Gabriel de Souza Pereira Moreira

gmoreira-nv

nvidia

Bo Liu

BoLiu

nvidia

How to build accurate, low-latency visual document retrieval with small Llama Nemotron models that work out-of-the-box with standard vector databases

In real applications, data is not just text. It lives in PDFs with charts, scanned contracts, tables, screenshots, and slide decks, so a text-only retrieval system will miss important information. Multimodal RAG pipelines change this by enabling retrieval and reasoning over text, images, and layouts together, leading to more accurate and actionable answers.

This post walks through two small Llama Nemotron models for multimodal retrieval over visual documents:

llama-nemotron-embed-vl-1b-v2 a dense single-vector multimodal (image + text) embedding model for page-level retrieval and similarity search.
llama-nemotron-rerank-vl-1b-v2 a cross-encoder reranking model for query–page relevance scoring.

Both models are:

Small enough to run with most NVIDIA GPU resources
Compatible with standard vector databases (single dense vector per page)
Designed to reduce hallucinations by grounding generation on better evidence, not longer prompts

We will show how they behave on realistic document benchmarks below.

Why multimodal RAG needs world-class retrieval

Multimodal RAG pipelines combine a retriever with a vision-language model (VLM) so responses are grounded in both retrieved page text and visual content, not just raw text prompts.

Embeddings control which pages are retrieved and shown to the VLM. Reranking models decide which of those pages are most relevant and should influence the answer. If either step is inaccurate, the VLM is more likely to hallucinate—often with high confidence. Using multimodal embeddings together with a multimodal reranker keeps generation grounded in the correct page images and text.

The State-of-the-Art in Commercial Multimodal Search

The llama-nemotron-embed-vl-1b-v2 and llama-nemotron-rerank-vl-1b-v2 models are designed for developers building multimodal question-answering and search over large corpora of PDFs and images.

The llama-nemotron-embed-vl-1b-v2 model is a single-vector (dense) embedding model that efficiently condenses visual and textual information into a single representation. This design ensures compatibility with all standard vector databases and enables millisecond-latency search at enterprise scale.

llama-nemotron-rerank-v1-1b-v2 is a cross-encoder reranking model that reorders the top retrieved candidates to improve relevance and boosts downstream answer quality without changing your storage or index format.

We evaluated llama-nemotron-embed-vl-1b-v2 and llama-nemotron-rerank-vl-1b-v2 on five visual document retrieval datasets: the popular ViDoRe V1, V2 and V3, a realistic visual document retrieval benchmark for enterprises composed of 8 public datasets, and two internal visual document retrieval datasets:

DigitalCorpora-10k: A dataset with over 1300 questions based on a corpus of 10,000 documents from DigitalCorpora that have a good mixture of text, tables, and charts.
Earnings V2: An internal retrieval dataset of 287 questions based on 500 PDFs, mostly consisting of earnings reports from big tech companies.

Visual Document Retrieval (page retrieval) benchmarks

The table below reports the average retrieval accuracy (Recall@5) across five datasets, focusing specifically on commercially viable dense retrieval models.

We can see that the llama-nemotron-embed-vl-1b-v2 provides better retrieval accuracy (Recall@5) for the image and image+text modalities than its predecessor, llama-3.2-nemoretriever-1b-vlm-embed-v1 and also better on text modality than llama-nemotron-embed-1b-v2, our small text embedding model. Finally, our VLM reranker llama-nemotron-rerank-vl-1b-v2 improves retrieval accuracy further by 7.2%, 6.9% and 6% per modality.

Note: Image+Text modality means that both the page image and its text (extracted using ingestion libraries like NV-Ingest) are fed as input to the embedding model for more accurate representation and retrieval.

Visual Document Retrieval benchmarks (page retrieval) – Avg Recall@5 on DigitalCorpora-10k, Earnings V2, ViDoRe V1, V2, V3

Model	Text	Image	Image + Text
llama-nemotron-embed-1b-v2	69.35%	-	-
llama-3.2-nemoretriever-1b-vlm-embed-v1	71.07%	70.46%	71.71%
llama-nemotron-embed-vl-1b-v2	71.04%	71.20%	73.24%
llama-nemotron-embed-vl-1b-v2 + llama-nemotron-rerank-vl-1b-v2	76.12%	76.12%	77.64%

The table below demonstrates the accuracy evaluation of llama-nemotron-rerank-vl-1b-v2 compared to two other publicly available multimodal reranker models: jina-reranker-m0 and MonoQwen2-VL-v0.1. Although jina-reranker-m0, performs well on image-only tasks, its public weights are restricted to non-commercial use (CC-BY-NC). In contrast, llama-nemotron-rerank-vl-1b-v2 offers superior performance across Text and combined Image+Text modalities, and its permissive commercial license makes it an ideal choice for enterprise deployments.

Model	Text	Image	Image+Text
llama-nemotron-rerank-vl-1b-v2	76.12%	76.12%	77.64%
jina-reranker-m0	69.31%	78.33%	NA
MonoQwen2-VL-v0.1	74.70%	75.80%	75.98%

Architectural Highlights & Training Methodology

The llama-nemotron-embed-vl-1b-v2 embedding model is a transformer-based encoder model, with approximately 1.7B parameters. It is a fine-tuned version of the NVIDIA Eagle family of models, using the Llama 3.2 1B language model and SigLip2 400M vision encoder. Embedding models for retrieval are typically trained with a bi-encoder architecture that encodes query and document independently. The model applies mean pooling over the output token embeddings from the language model, so that it outputs a single embedding with 2048 dimensions. Contrastive learning is used to train the embedding model to increase similarity between queries and relevant documents while decreasing similarity to negative samples.

The llama-nemotron-rerank-vl-1b-v2 is a cross-encoder model with approximately 1.7B parameters. It is also a fine-tuned version of an NVIDIA Eagle-family model. The final layer hidden states of the language model are aggregated using a mean pooling strategy, and a binary classification head is fine-tuned for the ranking task. The model was trained with CrossEntropy loss using publicly available and synthetically generated datasets.

How Organizations are Using These Models

Here are three examples of how organizations are applying the new Nemotron embedding and reranking models that you can adapt in your own systems.

Cadence: design and EDA workflows
Cadence models logic design assets such as micro-architecture and specification documents, constraints, and verification collateral as connected multimodal documents. As a result, an engineer can ask, “I want to extend the interrupt controller to support a low power state, show me which spec sections need changes,” and instantly surface the most relevant requirements. The system can then suggest a few alternative specification-update strategies, compare their tradeoffs, and generate the corresponding spec edits for the option the user selects.

IBM: domain-heavy storage and infra docs
IBM Storage treats each page of long PDFs—product guides, configuration manuals, and architecture diagrams—as a multimodal document, embeds it, and uses the reranker to prioritize pages where domain-specific terms, acronyms, and product names appear in the correct context before sending them to downstream LLMs. This improves how AI systems interpret storage concepts and reason over complex infrastructure documentation.

ServiceNow: chat over large sets of PDFs
ServiceNow uses multimodal embeddings to index pages from organizational PDFs and then applies the reranker to select the most relevant pages for each user query in its “Chat with PDF” experiences. By keeping high-scoring pages in context across turns, their agents maintain more coherent conversations and help users navigate large document collections more effectively.

Get Started

You can try the models directly:

Run llama-nemotron-embed-vl-1b-v2 in your vector database of choice to power multimodal search over PDFs and images.
Add llama-nemotron-rerank-vl-1b-v2 as a second-stage reranker on your top-k results to improve retrieval quality without changing your index.
Download Nemotron RAG models if you want end-to-end components for agents. Models aren’t limited to standalone use—they can also be integrated into ingestion pipelines.

Plug the new models into your existing RAG stack, or combine them with other open models on Hugging Face to build multimodal agents that understand your PDFs, not just their extracted text.

Stay up to date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, YouTube and the Nemotron channel on Discord.

Community

cveavy

2 days ago

Impressive work from the NVIDIA team on these Nemotron VL models!
The architectural decision to use a bi-encoder design with mean pooling for the embedding model while maintaining 2048-dimensional outputs is particularly smart for production scalability - it preserves compatibility with existing vector databases while the cross-encoder reranker adds that crucial relevance boost without index modifications.
The 6-7% Recall@5 improvements with reranking are substantial in enterprise contexts, and I appreciate that they benchmarked on realistic datasets like DigitalCorpora-10k rather than just academic benchmarks.
The combination of Llama 3.2 1B with SigLip2 400M strikes an excellent parameter efficiency balance at 1.7B total, making these deployable on standard GPU infrastructure.
What's particularly compelling is the commercial licensing advantage over jina-reranker-m0 - that CC-BY-NC restriction has been a real blocker for enterprise adoption.
The contrastive learning approach with synthetic data augmentation for the reranker training is also a solid methodology choice.

Looking forward to testing these in production RAG pipelines, especially for document-heavy workflows where visual context significantly impacts retrieval quality - this could be a game-changer for enterprise document understanding systems.

cveavy

2 days ago

Finally, a multimodal RAG solution that doesn't compromise on inference speed or database compatibility - excited to integrate these into our production pipeline!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote