Abstract
GutenOCR enhances vision-language models for document understanding by enabling unified reading, detection, and grounding through prompt-based interfaces trained on diverse document types.
GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.
Community
We're excited to share our first open model release, a grounded VLM for OCR applications!
We also open-sourced our training code (for running things on a multi-GPU setup) with an Apache-2.0 license here: https://github.com/Roots-Automation/GutenOCR
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR (2026)
- DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM (2025)
- NVIDIA Nemotron Parse 1.1 (2025)
- dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model (2025)
- HunyuanOCR Technical Report (2025)
- DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA (2025)
- Qwen3-VL Technical Report (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper