Parsing Excel formulas: A grammar and its application on 4 large datasets

Efthimia Aivaloglou; David Hoepelman; Felienne Hermans

doi:10.1002/smr.1895

Parsing Excel formulas: A grammar and its application on 4 large datasets

Efthimia Aivaloglou, David Hoepelman, Felienne Hermans

Software Engineering

Research output: Contribution to journal › Special issue › Scientific › peer-review

Abstract

Spreadsheets are popular end user programming tools, especially in the industrial world. This makes them interesting research targets. However, there does not exist a reliable grammar that is concise enough to facilitate formula parsing and analysis and to support research on spreadsheet codebases. This paper presents a grammar for spreadsheet formulas that can successfully parse 99.99% of more than 8 million unique formulas extracted from 4 spreadsheet datasets. Our grammar is compatible with the spreadsheet formula language, recognizes the spreadsheet formula elements that are required for supporting spreadsheets research, and produces parse trees aimed at further manipulation and analysis. Additionally, we use the grammar to analyze the characteristics of the formulas of the 4 datasets in 3 different dimensions: complexity, functionality, and data utilization. Our results show that (1) most Excel formulas are simple, however formulas with more than 50 functions or operations exist, (2) almost all formulas use data from other cells, which is often not local, and (3) a surprising number of referring mechanisms are used by less than 1% of the formulas.

Original language	English
Pages (from-to)	1-19
Number of pages	19
Journal	Journal of Software: Evolution and Process
Volume	29
Issue number	12
DOIs	https://doi.org/10.1002/smr.1895
Publication status	Published - 1 Dec 2017
Event	SCAM 2015, Bremen, Germany - Piscataway Duration: 27 Sept 2015 → 28 Sept 2015

Bibliographical note

Special Issue on Source Code Analysis and Manipulation (SCAM 2015)

Keywords

formula grammer
spreadsheets
syntax

Access to Document

10.1002/smr.1895

Cite this

@article{782ebf329a724ea49b7bc12de6d50e5b,

title = "Parsing Excel formulas: A grammar and its application on 4 large datasets",

abstract = "Spreadsheets are popular end user programming tools, especially in the industrial world. This makes them interesting research targets. However, there does not exist a reliable grammar that is concise enough to facilitate formula parsing and analysis and to support research on spreadsheet codebases. This paper presents a grammar for spreadsheet formulas that can successfully parse 99.99% of more than 8 million unique formulas extracted from 4 spreadsheet datasets. Our grammar is compatible with the spreadsheet formula language, recognizes the spreadsheet formula elements that are required for supporting spreadsheets research, and produces parse trees aimed at further manipulation and analysis. Additionally, we use the grammar to analyze the characteristics of the formulas of the 4 datasets in 3 different dimensions: complexity, functionality, and data utilization. Our results show that (1) most Excel formulas are simple, however formulas with more than 50 functions or operations exist, (2) almost all formulas use data from other cells, which is often not local, and (3) a surprising number of referring mechanisms are used by less than 1% of the formulas.",

keywords = "formula grammer, spreadsheets, syntax",

author = "Efthimia Aivaloglou and David Hoepelman and Felienne Hermans",

note = "Special Issue on Source Code Analysis and Manipulation (SCAM 2015); SCAM 2015, Bremen, Germany ; Conference date: 27-09-2015 Through 28-09-2015",

year = "2017",

month = dec,

day = "1",

doi = "10.1002/smr.1895",

language = "English",

volume = "29",

pages = "1--19",

journal = "Journal of Software: Evolution and Process",

issn = "2047-7473",

number = "12",

}

TY - JOUR

T1 - Parsing Excel formulas

T2 - SCAM 2015, Bremen, Germany

AU - Aivaloglou, Efthimia

AU - Hoepelman, David

AU - Hermans, Felienne

N1 - Special Issue on Source Code Analysis and Manipulation (SCAM 2015)

PY - 2017/12/1

Y1 - 2017/12/1

N2 - Spreadsheets are popular end user programming tools, especially in the industrial world. This makes them interesting research targets. However, there does not exist a reliable grammar that is concise enough to facilitate formula parsing and analysis and to support research on spreadsheet codebases. This paper presents a grammar for spreadsheet formulas that can successfully parse 99.99% of more than 8 million unique formulas extracted from 4 spreadsheet datasets. Our grammar is compatible with the spreadsheet formula language, recognizes the spreadsheet formula elements that are required for supporting spreadsheets research, and produces parse trees aimed at further manipulation and analysis. Additionally, we use the grammar to analyze the characteristics of the formulas of the 4 datasets in 3 different dimensions: complexity, functionality, and data utilization. Our results show that (1) most Excel formulas are simple, however formulas with more than 50 functions or operations exist, (2) almost all formulas use data from other cells, which is often not local, and (3) a surprising number of referring mechanisms are used by less than 1% of the formulas.

AB - Spreadsheets are popular end user programming tools, especially in the industrial world. This makes them interesting research targets. However, there does not exist a reliable grammar that is concise enough to facilitate formula parsing and analysis and to support research on spreadsheet codebases. This paper presents a grammar for spreadsheet formulas that can successfully parse 99.99% of more than 8 million unique formulas extracted from 4 spreadsheet datasets. Our grammar is compatible with the spreadsheet formula language, recognizes the spreadsheet formula elements that are required for supporting spreadsheets research, and produces parse trees aimed at further manipulation and analysis. Additionally, we use the grammar to analyze the characteristics of the formulas of the 4 datasets in 3 different dimensions: complexity, functionality, and data utilization. Our results show that (1) most Excel formulas are simple, however formulas with more than 50 functions or operations exist, (2) almost all formulas use data from other cells, which is often not local, and (3) a surprising number of referring mechanisms are used by less than 1% of the formulas.

KW - formula grammer

KW - spreadsheets

KW - syntax

U2 - 10.1002/smr.1895

DO - 10.1002/smr.1895

M3 - Special issue

SN - 2047-7473

VL - 29

SP - 1

EP - 19

JO - Journal of Software: Evolution and Process

JF - Journal of Software: Evolution and Process

IS - 12

Y2 - 27 September 2015 through 28 September 2015

ER -

Parsing Excel formulas: A grammar and its application on 4 large datasets

Abstract

Bibliographical note

Keywords

Access to Document

Fingerprint

Cite this