🧮 LaTeX-Math-Render-HQ (Sample Dataset)
Welcome to the LaTeX-Math-Render-HQ sample dataset!
This dataset provides high-quality, rendered images of complex mathematical expressions (formulas, theorems, and equations) paired directly with their underlying LaTeX source code and heavily enriched metadata (including AST complexity and exact bounding boxes).
🚀 Seeking the Full ~100k Dataset?
This dataset is a small sample (760 records) of a much larger, rigorously generated dataset containing over 100,000 highly diverse mathematical expressions spanning Calculus, Linear Algebra, Probability, Statistics, and more.If you are interested in acquiring the paid, commercially-licensed full dataset (~100k records) for training Vision-Language Models (VLMs), OCR engines, or specialized math AI, please reach out!
📧 Contact for Full Dataset: [email protected]
(Send an email and we will notify you immediately once the full dataset is ready for distribution.)
👁️ Visual Proof & Data at a Glance
Unlike noisy OCR datasets, every image in this dataset is rendered synthetically with exact bounding box (bboxes) coordinates for each mathematical symbol, enabling advanced object-detection and logical layout training.
| Clean 300 DPI Render | Token Bounding Boxes |
|---|---|
![]() | ![]() |
Although the sample uses highly compressed .parquet files for efficiency, a typical record in our full JSON-Lines format looks like this:
{
"latex": "F_n = F_{n-1} + F_{n-2}, \\quad F_0=0, F_1=1",
"complexity": 3.0,
"engine": "pdflatex",
"dpi": 300,
"font_profile": "default",
"dimensions": {"width": 816, "height": 68},
"bboxes": "[{\"id\": 0, \"bbox\": [10, 20, 30, 45]}, ...]"
}
📊 Dataset Structure & Columns
Each record in this Parquet dataset represents a single rendered formula under a specific configuration (varying engines and DPIs to ensure model robustness).
Here is a detailed breakdown of the tabular columns provided:
| Column Name | Data Type | Description |
|---|---|---|
image | bytes | The raw PNG image bytes of the rendered mathematical expression. You can load this directly using PIL.Image.open(io.BytesIO(row['image'])). |
latex | string | The exact ground-truth LaTeX source code used to generate the image. |
latext_tokens | list[string] | A robustly tokenized version of the LaTeX string. Expressions are split into meaningful sequences (e.g., separating commands like \alpha from standalone characters). Useful for sequence-to-sequence training. |
engine | string | The rendering engine used to generate the image. This sample contains renders from both pdflatex and mathjax. |
font_profile | string | The font profile applied during rendering (e.g., default). The full dataset contains massive font variations to prevent model overfitting. |
dpi | integer | The resolution (Dots Per Inch) of the rendered image. This sample includes 100 and 300 DPI variations for scale invariance testing. |
width | integer | The width of the generated image in pixels. |
height | integer | The height of the generated image in pixels. |
complexity | float | A purely objective, algorithmically calculated Abstract Syntax Tree (AST) complexity score. Higher scores represent deeper structural nesting and more advanced operators (e.g. nested integrals vs simple arithmetic). |
bboxes | string | Stringified JSON array containing bounding boxes for individual equations/elements within the image (if generated by the engine metadata). |
🌟 Why Use This Dataset?
- Perfect Ground Truth: No OCR noise. Every image is synthetically generated directly from its LaTeX pair.
- Unmatched Diversity:
- Font Engines: Generates output using
mathjaxandpdflatex(with support for LaTeX fonts like Latin Modern, Times, Helvetica, etc.) to prevent model overfitting. - Symbol Coverage: Synthesizes expressions using over 500+ standard mathematical operators and symbols, spanning deep linear algebra matrices to complex vector calculus.
- Font Engines: Generates output using
- Structural Complexity: The embedded
complexityscore (Ast Depth) allows you to easily implement Curriculum Learning (training models on easy formulas first, then progressively harder ones).
💡 Quick Start
You can load and visualize this sample dataset easily using Python, Pandas, and Pillow:
import pandas as pd
from PIL import Image
import io
# Load the dataset
df = pd.read_parquet("dataset.parquet")
# View the first record
sample = df.iloc[0]
print(f"LaTeX: {sample['latex']}")
print(f"Engine: {sample['engine']} | DPI: {sample['dpi']}")
print(f"Complexity Score: {sample['complexity']}")
# Display the image
image = Image.open(io.BytesIO(sample['image']))
image.show()
🔮 Upcoming Feature (Full Dataset Only)
Real-world Augmentation (Shadows, Paper Textures, Motion Blur) is coming to the full 100k dataset.
For developers building “Photo-to-Math” or “Homework-Solver” applications, models trained purely on clean digital renders often fail in production. The upcoming pipeline injects mathematically principled camera distortions, lighting gradients, and paper grain to simulate real-world mobile photography.
Interested in taking your mathematical AI to the next level? Don’t forget to email [email protected] to get notified about the 100k+ commercial dataset!

