Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

12-2025

Abstract

Spreadsheets are widely used for data analysis and reporting, yet their complex structure and formula logic pose significant challenges for AI systems. We introduce Sheetpedia, a large-scale corpus of over 290,000 diverse spreadsheets (from 324,000+ workbooks) compiled from enterprise email archives and online forums. We detail a rigorous collection and preprocessing pipeline (integrating the Enron email spreadsheet archive and the Fuse web corpus, plus a new crawl of Excel forums) to standardize formats, filter languages, and remove duplicates. Sheetpedia provides extensive coverage of real formulas and annotations – addressing a gap left by prior table datasets (e.g. web tables used in TURL or Text-to-SQL in Spider) which often lack formula semantics. We present comprehensive corpus statistics, highlighting rich formula diversity and a majority (78%+) of English content. To demonstrate the corpus’s utility, we fine-tune large language models on Sheetpedia for two novel spreadsheet understanding tasks: Natural Language to Semantic Range (NL2SR) and Natural Language to Formula (NL2Formula). Using a rejection-sampling data generation strategy, our fine-tuned models achieve up to 97.5% accuracy on NL2SR and 71.7% on NL2Formula – substantially outperforming baseline approaches. Sheetpedia (to be released publicly) fills a crucial need for a large, high-quality spreadsheet benchmark, enabling more effective spreadsheet intelligence and natural language interfaces for spreadsheet tools.

Discipline

Artificial Intelligence and Robotics

Areas of Excellence

Digital transformation

Publication

Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), San Diego, CA, December 2-7

First Page

1

Last Page

31

City or Country

USA

Additional URL

https://openreview.net/forum?id=4vLYwlA3X5

Share

COinS