Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Our CoSyn-400K dataset contains 9 categories of synthetc text-rich images with 2.7M instruction-tuning data.

[Average performance on 7 text-rich benchmarks: ChartQA, DocVQA, InfoVQA, TableVQA, AI2D, TextVQA, ScreenQA.
Our zero-shot model does not expose to any training instances from the evaluation benchmarks.]

Our models trained on synthetic data achieves competitive performance on text-rich benchmarks.

Abstract

Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.

Code Guided Synthetic Data Generation

Our Code Guided Synthetic data generation system (CoSyn) supports 20 generation pipelines based on 11 render tools. Given a user query, e.g., "book cover", CoSyn selects the appropriate pipelines and starts with generating diverse topics conditioned on personas, then synthesizes detailed data for code generation. The code renders the image and is also fed as context for an LLM to construct instruction-tuning data.

CoSyn can Generate Diverse Text-rich Images

CoSyn uses 11 rendering tools to build 20 pipelines that support the generation of charts, documents, diagrams, vector graphics, and many more ...

Geographic Plot

Waterfall Chart

Sunburst Plot

Treemap

Gauge Chart

Ternary Plot

Rose Chart

Volcano Plot

Isotype Chart

Ridgeline Plot

Menu

Business Card

Table

Math Problem

Receipt

GitHub Page

Résumé

Shipping Label

Infographic

Book Cover

Booking Website

Phone Screen

Music Sheet

Chemical Structure

Solid Geometry

Electrical Circuit

Quadrant Chart

Terminal

IQ Test

Flow Chart

Sankey Diagram

Model Trained on CoSyn-400K Achieves SOTA Performance

Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2 V 11B, and surpass proprietary models such as GPT-4V, Gemini 1.5 Flash and Claude-3 Opus.

[The result of the best-performing open-source model is bold, and the second-best is underlined. Models with † stand for open data and code for multimodal training. Models with * are zero-shot models, which means the models are not trained on instances from any of the evaluation datasets]

CoSyn can Generate Targeted Data for Domain Generalization

We identify a key limitation of open-source VLMs that they struggle to generalize to out-of-domain tasks they were not trained on. However, CoSyn enables controllable data generation, allowing task-specific fine-tuning to achieve strong generalization performance with significantly less data. Download NutritionQA.

[Zero shot performance on NutritionQA. The x-axis denotes the number of training examples used for
the instruction-tuning stage. The models on the upper left side demonstrate better data efficiency.]

Synthetic Pointing Data for Agentic Tasks

We synthesize pointing data by prompting an LLM to generate pointing questions and edit the code to draw the points explicitly. By extracting the pixel values of these points, we can obtain their coordinates. We show that VLMs trained on our synthetic pointing data can generalize to real agentic tasks. Download CoSyn-point.

Our models achieve SOTA performance on ScreenSpot Click Prediction Task.

[Click accuracy on ScreenSpot. Human stands for using the human-annotated data from PixMo-point.]

BibTeX

@article{yang2025scaling,
      title={Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation},
      author={Yang, Yue and Patel, Ajay and Deitke, Matt and Gupta, Tanmay and Weihs, Luca and Head, Andrew and Yatskar, Mark and Callison-Burch, Chris and Krishna, Ranjay and Kembhavi, Aniruddha and others},
      journal={arXiv preprint arXiv:2502.14846},
      year={2025}
}