Publications

Conference Papers


DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA

Published in ICCV (VisionDocs Workshop), 2025 | Code Repository | Demo
Oral Spotlight & Best Paper Award
Rayane Bencharef, Abderrahmane Rahiche, Mohamed Cheriet

This paper is about optimizing Visual Encoder (VE) of end-to-end DocVQA architectures. While reducing by 5x the size of the VE from a foundational to a hierarchical architecture, DIVE-Doc achieves a performance gap of 2.10 (ANLS) compared to its teacher Paligemma through a distillation training process. Moreover, this allowed to halves the visual module latency. Evaluation of the VE on downstream tasks (document classification & layout analysis) led to the conclusion that the VE seems to capture global structure of documents while semantic layout understanding seems to be achieved inside the language model decoder.