We introduce ScImage V2, a benchmark dataset for evaluating scientific image generation across four major domains: biology, mathematics, computer science, and physics. ScImage V2 features an expanded terminology set, diverse mathematical functions, and a broad range of chart types. It includes over 2,000 high-quality, template-based text-to-image pairs designed to evaluate fine-grained scientific image generation capabilities. The dataset supports the development and assessment of more capable and reliable multimodal LLMs for scientific applications.
- Chart_Types: JSON files specifying all chart types used
- Filled_Templates: CSVs of all 10 filled template batches (prompt + output)
- Groupings: Grouped terms used to fill templates
- Human_Evals: Human annotations for template evaluation, correction, and filtering
- Plots: PDFs of all chart visuals used in the accompanying paper
- Python_Code_1000: Python scripts for the first 1000 template examples
- Python_Images_291: Rendered Python-generated images (291 samples)
- TikZ_Code_1000: TikZ code for the first 1000 template examples
- TikZ_Images_291: Rendered TikZ-generated images (291 samples)
- ScImage_V1: Prompts and templates from ScImage V1 for baseline comparison
- Scripts: Code for extracting chart types, generating plots, and filling templates
- Templates: Human-curated templates (domain terms, math functions, charts)
- ScImage_V2_Presentation: ScImage V2 Dataset Presentation
- ScImage_V2_Paper: ScImage V2 Dataset Paper
- Understanding_Reasoning_Types: Specifies the types of reasoning the template involves. Attribute, Spatial, Numerical, or any combination of these.
- Reasoning: Indicating whether the template requires reasoning to be correctly completed.
- Difficulty: An integer from 1 (easy) to 3 (hard), reflecting the complexity of the template.
- Template_Type: The category of the template. Options include domain_term (terms from DaTikZ V3), math_function, or chart.
- Group: High-level domain category of the term, such as CS, Math, Biology, or Computer Science.
- Subgroup: More specific classification within the selected group such as Computational Geometry for Computer Science.
- Template: The original template with placeholders, used for inserting selected terms.
- Chosen_Terms: The specific terms selected by the LLM (GPT-4o) to fill into the template.
- Filled_Template: The initial version of the template after term insertion, generated by GPT-4o.
- Corrected_Template: A revised and improved version of the filled template, also generated by GPT-4o.
- Evaluated_Template: Binary evaluation indicating whether the corrected template is acceptable (1 = good, 0 = still problematic).
For downstream use, prefer the Corrected_Template over the Filled_Template. Evaluated_Template = 1 ensures you are more likely to work with corrected templates that are visualizable.