Own Your Data Science and AI Workshop

LightOnOCR: Pushing the Performance-Efficiency Pareto Frontier of Open OCR Models

Date May 5 Time 11:20 - 11:40 Location Open Stage

LightOnOCR is a 1B-parameter VLM for OCR designed to push the performance-efficiency Pareto curve for real-world document understanding. In this talk, we present the motivation behind LightOnOCR, the key design choices behind an end-to-end multilingual OCR model, and the practical trade-offs involved in building models that are both accurate and efficient. We discuss the full training pipeline, including how data is curated, cleaned, deduplicated, and augmented, as well as the many practical tricks needed to make large-scale OCR training work in practice. We cover pretraining on large image datasets, with stronger coverage of scans, scientific PDFs, and LaTeX-heavy content, and show how a final RLVR stage helps address persistent failure modes that supervised training alone cannot fully resolve, including repetition loops, formatting errors, and layout-sensitive consistency issues. At the time of its release, LightOnOCR topped OlmOCR-Bench while outperforming models up to 9x larger. Beyond benchmark results, the talk focuses on what matters in practice: achieving high throughput and low latency on realistic hardware, such as a single H100, where larger VLM-based approaches are too slow to be usable.

Speakers

Said Taghadouini LightOn

Baptiste Aubertin LightOn