Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
12-2025
Abstract
Spreadsheets are widely used for data analysis and reporting, yet their complex structure and formula logic pose significant challenges for AI systems. We introduce Sheetpedia, a large-scale corpus of over 290,000 diverse spreadsheets (from 324,000+ workbooks) compiled from enterprise email archives and online forums. We detail a rigorous collection and preprocessing pipeline (integrating the Enron email spreadsheet archive and the Fuse web corpus, plus a new crawl of Excel forums) to standardize formats, filter languages, and remove duplicates. Sheetpedia provides extensive coverage of real formulas and annotations – addressing a gap left by prior table datasets (e.g. web tables used in TURL or Text-to-SQL in Spider) which often lack formula semantics. We present comprehensive corpus statistics, highlighting rich formula diversity and a majority (78%+) of English content. To demonstrate the corpus’s utility, we fine-tune large language models on Sheetpedia for two novel spreadsheet understanding tasks: Natural Language to Semantic Range (NL2SR) and Natural Language to Formula (NL2Formula). Using a rejection-sampling data generation strategy, our fine-tuned models achieve up to 97.5% accuracy on NL2SR and 71.7% on NL2Formula – substantially outperforming baseline approaches. Sheetpedia (to be released publicly) fills a crucial need for a large, high-quality spreadsheet benchmark, enabling more effective spreadsheet intelligence and natural language interfaces for spreadsheet tools.
Discipline
Artificial Intelligence and Robotics
Areas of Excellence
Digital transformation
Publication
Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), San Diego, CA, December 2-7
First Page
1
Last Page
31
City or Country
USA
Citation
TIAN, Zailong; HAN, Zhuoheng; WANG, Houfeng; and LIAO, Lizi.
Sheetpedia: A 300K-spreadsheet corpus for spreadsheet intelligence and LLM fine-tuning. (2025). Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), San Diego, CA, December 2-7. 1-31.
Available at: https://ink.library.smu.edu.sg/sis_research/10751
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://openreview.net/forum?id=4vLYwlA3X5