🧮 LaTeX-Math-Render-HQ (Sample Dataset)

Welcome to the LaTeX-Math-Render-HQ sample dataset!

This dataset provides high-quality, rendered images of complex mathematical expressions (formulas, theorems, and equations) paired directly with their underlying LaTeX source code and heavily enriched metadata (including AST complexity and exact bounding boxes).

🚀 Seeking the Full ~100k Dataset?
This dataset is a small sample (760 records) of a much larger, rigorously generated dataset containing over 100,000 highly diverse mathematical expressions spanning Calculus, Linear Algebra, Probability, Statistics, and more.

If you are interested in acquiring the paid, commercially-licensed full dataset (~100k records) for training Vision-Language Models (VLMs), OCR engines, or specialized math AI, please reach out!

📧 Contact for Full Dataset: [email protected]
(Send an email and we will notify you immediately once the full dataset is ready for distribution.)


👁️ Visual Proof & Data at a Glance

Unlike noisy OCR datasets, every image in this dataset is rendered synthetically with exact bounding box (bboxes) coordinates for each mathematical symbol, enabling advanced object-detection and logical layout training.

Clean 300 DPI RenderToken Bounding Boxes
Clean ImageBoxed Image

Although the sample uses highly compressed .parquet files for efficiency, a typical record in our full JSON-Lines format looks like this:

{
  "latex": "F_n = F_{n-1} + F_{n-2}, \\quad F_0=0, F_1=1",
  "complexity": 3.0,
  "engine": "pdflatex",
  "dpi": 300,
  "font_profile": "default",
  "dimensions": {"width": 816, "height": 68},
  "bboxes": "[{\"id\": 0, \"bbox\": [10, 20, 30, 45]}, ...]"
}

📊 Dataset Structure & Columns

Each record in this Parquet dataset represents a single rendered formula under a specific configuration (varying engines and DPIs to ensure model robustness).

Here is a detailed breakdown of the tabular columns provided:

Column NameData TypeDescription
imagebytesThe raw PNG image bytes of the rendered mathematical expression. You can load this directly using PIL.Image.open(io.BytesIO(row['image'])).
latexstringThe exact ground-truth LaTeX source code used to generate the image.
latext_tokenslist[string]A robustly tokenized version of the LaTeX string. Expressions are split into meaningful sequences (e.g., separating commands like \alpha from standalone characters). Useful for sequence-to-sequence training.
enginestringThe rendering engine used to generate the image. This sample contains renders from both pdflatex and mathjax.
font_profilestringThe font profile applied during rendering (e.g., default). The full dataset contains massive font variations to prevent model overfitting.
dpiintegerThe resolution (Dots Per Inch) of the rendered image. This sample includes 100 and 300 DPI variations for scale invariance testing.
widthintegerThe width of the generated image in pixels.
heightintegerThe height of the generated image in pixels.
complexityfloatA purely objective, algorithmically calculated Abstract Syntax Tree (AST) complexity score. Higher scores represent deeper structural nesting and more advanced operators (e.g. nested integrals vs simple arithmetic).
bboxesstringStringified JSON array containing bounding boxes for individual equations/elements within the image (if generated by the engine metadata).

🌟 Why Use This Dataset?

💡 Quick Start

You can load and visualize this sample dataset easily using Python, Pandas, and Pillow:

import pandas as pd
from PIL import Image
import io

# Load the dataset
df = pd.read_parquet("dataset.parquet")

# View the first record
sample = df.iloc[0]

print(f"LaTeX: {sample['latex']}")
print(f"Engine: {sample['engine']} | DPI: {sample['dpi']}")
print(f"Complexity Score: {sample['complexity']}")

# Display the image
image = Image.open(io.BytesIO(sample['image']))
image.show()

🔮 Upcoming Feature (Full Dataset Only)

Real-world Augmentation (Shadows, Paper Textures, Motion Blur) is coming to the full 100k dataset.

For developers building “Photo-to-Math” or “Homework-Solver” applications, models trained purely on clean digital renders often fail in production. The upcoming pipeline injects mathematically principled camera distortions, lighting gradients, and paper grain to simulate real-world mobile photography.

Interested in taking your mathematical AI to the next level? Don’t forget to email [email protected] to get notified about the 100k+ commercial dataset!