A Math Formula Extraction and Evaluation Framework for PDF Documents

Published in International Conference on Document Analysis and Recognition (ICDAR), 2021

Recommended citation: A. K. Shah, A. Dey, and R. Zanibbi, “A Math Formula Extraction and Evaluation Framework for PDF Documents,” in Document Analysis and Recognition – ICDAR 2021, Cham, 2021, pp. 19–34. doi: 10.1007/978-3-030-86331-9_2.

[url] [pdf] [poster] [video] [code]


We present a processing pipeline for math formula extraction in PDF documents that takes advantage of character information in born-digital PDFs (e.g., created using LATEX or Word). Our pipeline is designed for indexing math in technical document collections to support math-aware search engines capable of processing queries containing keywords and formulas. The system includes user-friendly tools for visualizing recognition results in HTML pages. Our pipeline is comprised of a new state-of-the-art PDF character extractor that identifies precise bounding boxes for non-Latin symbols, a novel Single Shot Detectorbased formula detector, and an existing graph-based formula parser (QDGGA) for recognizing formula structure. To simplify analyzing structure recognition errors, we have extended the LgEval library (from the CROHME competitions) to allow viewing all instances of specific errors by clicking on HTML links. Our source code is publicly available.


author="Shah, Ayush Kumar and Dey, Abhisek and Zanibbi, Richard",
editor="Llad{\'o}s, Josep and Lopresti, Daniel and Uchida, Seiichi",
title="A Math Formula Extraction and Evaluation Framework for PDF Documents",
booktitle="Document Analysis and Recognition -- ICDAR 2021",                 
publisher="Springer International Publishing",                               

Leave a Comment