🧮 LaTeX-Math-Render-HQ (Sample Dataset)

Welcome to the LaTeX-Math-Render-HQ sample dataset!

This dataset provides high-quality, rendered images of complex mathematical expressions (formulas, theorems, and equations) paired directly with their underlying LaTeX source code and heavily enriched metadata (including AST complexity and exact bounding boxes).

🚀 Seeking the Full ~100k Dataset?
This dataset is a small sample (760 records) of a much larger, rigorously generated dataset containing over 100,000 highly diverse mathematical expressions spanning Calculus, Linear Algebra, Probability, Statistics, and more.

If you are interested in acquiring the paid, commercially-licensed full dataset (~100k records) for training Vision-Language Models (VLMs), OCR engines, or specialized math AI, please reach out!

📧 Contact for Full Dataset: [email protected]
(Send an email and we will notify you immediately once the full dataset is ready for distribution.)

👁️ Visual Proof & Data at a Glance

Unlike noisy OCR datasets, every image in this dataset is rendered synthetically with exact bounding box (bboxes) coordinates for each mathematical symbol, enabling advanced object-detection and logical layout training.

Clean 300 DPI Render	Token Bounding Boxes

Although the sample uses highly compressed .parquet files for efficiency, a typical record in our full JSON-Lines format looks like this:

{
  "latex": "F_n = F_{n-1} + F_{n-2}, \\quad F_0=0, F_1=1",
  "complexity": 3.0,
  "engine": "pdflatex",
  "dpi": 300,
  "font_profile": "default",
  "dimensions": {"width": 816, "height": 68},
  "bboxes": "[{\"id\": 0, \"bbox\": [10, 20, 30, 45]}, ...]"
}

📊 Dataset Structure & Columns

Each record in this Parquet dataset represents a single rendered formula under a specific configuration (varying engines and DPIs to ensure model robustness).

Here is a detailed breakdown of the tabular columns provided:

Column Name	Data Type	Description
`image`	`bytes`	The raw PNG image bytes of the rendered mathematical expression. You can load this directly using `PIL.Image.open(io.BytesIO(row['image']))`.
`latex`	`string`	The exact ground-truth LaTeX source code used to generate the image.
`latext_tokens`	`list[string]`	A robustly tokenized version of the LaTeX string. Expressions are split into meaningful sequences (e.g., separating commands like `\alpha` from standalone characters). Useful for sequence-to-sequence training.
`engine`	`string`	The rendering engine used to generate the image. This sample contains renders from both `pdflatex` and `mathjax`.
`font_profile`	`string`	The font profile applied during rendering (e.g., `default`). The full dataset contains massive font variations to prevent model overfitting.
`dpi`	`integer`	The resolution (Dots Per Inch) of the rendered image. This sample includes `100` and `300` DPI variations for scale invariance testing.
`width`	`integer`	The width of the generated image in pixels.
`height`	`integer`	The height of the generated image in pixels.
`complexity`	`float`	A purely objective, algorithmically calculated Abstract Syntax Tree (AST) complexity score. Higher scores represent deeper structural nesting and more advanced operators (e.g. nested integrals vs simple arithmetic).
`bboxes`	`string`	Stringified JSON array containing bounding boxes for individual equations/elements within the image (if generated by the engine metadata).

🌟 Why Use This Dataset?

Perfect Ground Truth: No OCR noise. Every image is synthetically generated directly from its LaTeX pair.
Unmatched Diversity:
- Font Engines: Generates output using mathjax and pdflatex (with support for LaTeX fonts like Latin Modern, Times, Helvetica, etc.) to prevent model overfitting.
- Symbol Coverage: Synthesizes expressions using over 500+ standard mathematical operators and symbols, spanning deep linear algebra matrices to complex vector calculus.
Structural Complexity: The embedded complexity score (Ast Depth) allows you to easily implement Curriculum Learning (training models on easy formulas first, then progressively harder ones).

💡 Quick Start

You can load and visualize this sample dataset easily using Python, Pandas, and Pillow:

import pandas as pd
from PIL import Image
import io

# Load the dataset
df = pd.read_parquet("dataset.parquet")

# View the first record
sample = df.iloc[0]

print(f"LaTeX: {sample['latex']}")
print(f"Engine: {sample['engine']} | DPI: {sample['dpi']}")
print(f"Complexity Score: {sample['complexity']}")

# Display the image
image = Image.open(io.BytesIO(sample['image']))
image.show()

🔮 Upcoming Feature (Full Dataset Only)

Real-world Augmentation (Shadows, Paper Textures, Motion Blur) is coming to the full 100k dataset.

For developers building “Photo-to-Math” or “Homework-Solver” applications, models trained purely on clean digital renders often fail in production. The upcoming pipeline injects mathematically principled camera distortions, lighting gradients, and paper grain to simulate real-world mobile photography.

Interested in taking your mathematical AI to the next level? Don’t forget to email [email protected] to get notified about the 100k+ commercial dataset!