Our CoSyn-400K dataset contains 9 categories of synthetc text-rich images with 2.7M instruction-tuning data.
Our models trained on synthetic data achieves competitive performance on text-rich benchmarks.
Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.
Our Code Guided Synthetic data generation system (CoSyn) supports 20 generation pipelines based on 11 render tools. Given a user query, e.g., "book cover", CoSyn selects the appropriate pipelines and starts with generating diverse topics conditioned on personas, then synthesizes detailed data for code generation. The code renders the image and is also fed as context for an LLM to construct instruction-tuning data.
CoSyn uses 11 rendering tools to build 20 pipelines that support the generation of charts, documents, diagrams, vector graphics, and many more ...
Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2 V 11B, and surpass proprietary models such as GPT-4V, Gemini 1.5 Flash and Claude-3 Opus.
We identify a key limitation of open-source VLMs that they struggle to generalize to out-of-domain tasks they were not trained on. However, CoSyn enables controllable data generation, allowing task-specific fine-tuning to achieve strong generalization performance with significantly less data. Download NutritionQA.
We synthesize pointing data by prompting an LLM to generate pointing questions and edit the code to draw the points explicitly. By extracting the pixel values of these points, we can obtain their exact coordinates. We show that the VLM trained on our synthetic pointing data can generalize to real GUI agentic tasks.
Our models achieve SOTA performance on ScreenSpot Click Prediction Task.
@article{yang2025scaling,
title={Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation},
author={Yue Yang and Ajay Patel and Matt Deitke and Tanmay Gupta and Luca Weihs and Andrew Head and Mark Yatskar and Chris Callison-Burch and Ranjay Krishna and Aniruddha Kembhavi and Christopher Clark},
journal={arXiv preprint arXiv:2502.14846},
year={2025}
}