Scaling Text-Rich Image Understanding
via Code-Guided Synthetic
Multimodal Data Generation

1University of Pennsylvania,   2Allen Institute for Artificial Intelligence
*Equal Contribution

Our CoSyn-400K dataset contains 9 categories of synthetc text-rich images with 2.7M instruction-tuning data.


[Average performance on 7 text-rich benchmarks: ChartQA, DocVQA, InfoVQA, TableVQA, AI2D, TextVQA, ScreenQA.
Our zero-shot model does not expose to any training instances from the evaluation benchmarks.]

Our models trained on synthetic data achieves competitive performance on text-rich benchmarks.

Abstract

Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.

Code Guided Synthetic Data Generation

Our Code Guided Synthetic data generation system (CoSyn) supports 20 generation pipelines based on 11 render tools. Given a user query, e.g., "book cover", CoSyn selects the appropriate pipelines and starts with generating diverse topics conditioned on personas, then synthesizes detailed data for code generation. The code renders the image and is also fed as context for an LLM to construct instruction-tuning data.

CoSyn can Generate Diverse Text-rich Images

CoSyn uses 11 rendering tools to build 20 pipelines that support the generation of charts, documents, diagrams, vector graphics, and many more ...


Geographic Plot
geographic
Waterfall Chart
waterfall
Sunburst Plot
sunburst
Treemap
treemap
Gauge Chart
gauge
Ternary Plot
ternary
Rose Chart
rose
Volcano Plot
volcano
Isotype Chart
isotype
Ridgeline Plot
ridgeline
Menu
menu
Business Card
business
Table
table
Math Problem
math
Receipt
receipt
GitHub Page
githubpage
Résumé
resume
Shipping Label
shipping
Infographic
infographic
Book Cover
book
Booking Website
booking
Phone Screen
phone
Music Sheet
music
Chemical Structure
chemical
Solid Geometry
geographic
Electrical Circuit
circuit
Quadrant Chart
quadrant
Terminal
terminal
IQ Test
IQ
Flow Chart
flow
Sankey Diagram
sankey

Model Trained on CoSyn-400K Achieves SOTA Performance

Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2 V 11B, and surpass proprietary models such as GPT-4V, Gemini 1.5 Flash and Claude-3 Opus.

Interpolate start reference image.

[The result of the best-performing open-source model is bold, and the second-best is underlined. Models with † stand for open data and code for multimodal training. Models with * are zero-shot models, which means the models are not trained on instances from any of the evaluation datasets]

CoSyn can Generate Targeted Data for Domain Generalization

We identify a key limitation of open-source VLMs that they struggle to generalize to out-of-domain tasks they were not trained on. However, CoSyn enables controllable data generation, allowing task-specific fine-tuning to achieve strong generalization performance with significantly less data. Download NutritionQA.

[Zero shot performance on NutritionQA. The x-axis denotes the number of training examples used for
the instruction-tuning stage. The models on the upper left side demonstrate better data efficiency.]

Synthetic Pointing Data for Agentic Tasks

We synthesize pointing data by prompting an LLM to generate pointing questions and edit the code to draw the points explicitly. By extracting the pixel values of these points, we can obtain their exact coordinates. We show that the VLM trained on our synthetic pointing data can generalize to real GUI agentic tasks.



Our models achieve SOTA performance on ScreenSpot Click Prediction Task.

[Click accuracy on ScreenSpot. Human stands for using the human-annotated data from PixMo-point.]

BibTeX

@article{yang2025scaling,
      title={Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation}, 
      author={Yue Yang and Ajay Patel and Matt Deitke and Tanmay Gupta and Luca Weihs and Andrew Head and Mark Yatskar and Chris Callison-Burch and Ranjay Krishna and Aniruddha Kembhavi and Christopher Clark},
      journal={arXiv preprint arXiv:2502.14846},
      year={2025}
}