Exploring the Feasibility of Crowd-Powered Decomposition of Complex User Questions in Text-to-SQL Tasks

Sara Salimzadeh; Ujwal Gadiraju; Claudia Hauff; Arie Van Deursen

doi:10.1145/3511095.3531282

Exploring the Feasibility of Crowd-Powered Decomposition of Complex User Questions in Text-to-SQL Tasks

Sara Salimzadeh, Ujwal Gadiraju, Claudia Hauff, Arie Van Deursen

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

3 Citations (Scopus)

78 Downloads (Pure)

Abstract

Natural Language Interfaces to Databases (NLIDB), also known as Text-to-SQL models, enable users with different levels of knowledge in Structured Query Language (SQL) to access relational databases without any programming effort. By translating natural languages into SQL query, not only do NLIDBs minimize the burden of memorizing the schema of databases and writing complex SQL queries, but they also allow non-experts to acquire information from databases in natural languages. However, existing NLIDBs largely fail to translate natural languages to SQL when they are complex, preventing them from being deployed in real-world scenarios and generalizing across unseen complex databases. In this paper, we explored the feasibility of decomposing complex user questions into multiple sub-questions - each with a reduced complexity - as a means to circumvent the problem of complex SQL generation. We investigated the feasibility of decomposing complex user questions in a manner that each sub-question is simple enough for existing NLIDBs to generate correct SQL queries, using non-expert crowd workers in juxtaposition with SQL experts. Through an empirical study on an NLIDB benchmark dataset, we found that crowd-powered decomposition of complex user questions led to an accuracy boost of an existing Text-to-SQL pipeline from 30% to 59% (96% accuracy boost). Similarly, decomposition by SQL experts resulted in boosting the accuracy to 76% (153% accuracy boost). Our findings suggest that crowd-powered decomposition can be a scalable alternative to producing the training data necessary to build machine learning models that can automatically decompose complex user questions, thereby improving Text-to-SQL pipelines.

Original language	English
Title of host publication	HT 2022
Subtitle of host publication	33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022
Publisher	Association for Computing Machinery (ACM)
Pages	154-165
Number of pages	12
ISBN (Electronic)	978-1-4503-9233-4
DOIs	https://doi.org/10.1145/3511095.3531282
Publication status	Published - 2022
Event	33rd ACM Conference on Hypertext and Social Media, HT 2022 - Co-located with ACM WebSci 2022 and ACM UMAP 2022 - Virtual, Online, Spain Duration: 28 Jun 2022 → 1 Jul 2022

Publication series

Name	HT 2022: 33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022

Conference

Conference	33rd ACM Conference on Hypertext and Social Media, HT 2022 - Co-located with ACM WebSci 2022 and ACM UMAP 2022
Country/Territory	Spain
City	Virtual, Online
Period	28/06/22 → 1/07/22

Keywords

Corpus Annotation
Crowdsourcing
Human Computation
Natural Language Interface to Databases
Semantic Parsing
Text-to-SQL

Access to Document

10.1145/3511095.3531282

3511095.3531282Final published version, 1 MBLicence: CC BY

Cite this

Salimzadeh, S., Gadiraju, U., Hauff, C., & Van Deursen, A. (2022). Exploring the Feasibility of Crowd-Powered Decomposition of Complex User Questions in Text-to-SQL Tasks. In HT 2022: 33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022 (pp. 154-165). (HT 2022: 33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022). Association for Computing Machinery (ACM). https://doi.org/10.1145/3511095.3531282

Salimzadeh, Sara ; Gadiraju, Ujwal ; Hauff, Claudia et al. / Exploring the Feasibility of Crowd-Powered Decomposition of Complex User Questions in Text-to-SQL Tasks. HT 2022: 33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022. Association for Computing Machinery (ACM), 2022. pp. 154-165 (HT 2022: 33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022).

@inproceedings{966a5972ca66414a9556a2031a5582a2,

title = "Exploring the Feasibility of Crowd-Powered Decomposition of Complex User Questions in Text-to-SQL Tasks",

abstract = "Natural Language Interfaces to Databases (NLIDB), also known as Text-to-SQL models, enable users with different levels of knowledge in Structured Query Language (SQL) to access relational databases without any programming effort. By translating natural languages into SQL query, not only do NLIDBs minimize the burden of memorizing the schema of databases and writing complex SQL queries, but they also allow non-experts to acquire information from databases in natural languages. However, existing NLIDBs largely fail to translate natural languages to SQL when they are complex, preventing them from being deployed in real-world scenarios and generalizing across unseen complex databases. In this paper, we explored the feasibility of decomposing complex user questions into multiple sub-questions - each with a reduced complexity - as a means to circumvent the problem of complex SQL generation. We investigated the feasibility of decomposing complex user questions in a manner that each sub-question is simple enough for existing NLIDBs to generate correct SQL queries, using non-expert crowd workers in juxtaposition with SQL experts. Through an empirical study on an NLIDB benchmark dataset, we found that crowd-powered decomposition of complex user questions led to an accuracy boost of an existing Text-to-SQL pipeline from 30% to 59% (96% accuracy boost). Similarly, decomposition by SQL experts resulted in boosting the accuracy to 76% (153% accuracy boost). Our findings suggest that crowd-powered decomposition can be a scalable alternative to producing the training data necessary to build machine learning models that can automatically decompose complex user questions, thereby improving Text-to-SQL pipelines.",

keywords = "Corpus Annotation, Crowdsourcing, Human Computation, Natural Language Interface to Databases, Semantic Parsing, Text-to-SQL",

author = "Sara Salimzadeh and Ujwal Gadiraju and Claudia Hauff and {Van Deursen}, Arie",

year = "2022",

doi = "10.1145/3511095.3531282",

language = "English",

series = "HT 2022: 33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022",

publisher = "Association for Computing Machinery (ACM)",

pages = "154--165",

booktitle = "HT 2022",

address = "United States",

note = "33rd ACM Conference on Hypertext and Social Media, HT 2022 - Co-located with ACM WebSci 2022 and ACM UMAP 2022 ; Conference date: 28-06-2022 Through 01-07-2022",

}

Salimzadeh, S , Gadiraju, U, Hauff, C & Van Deursen, A 2022, Exploring the Feasibility of Crowd-Powered Decomposition of Complex User Questions in Text-to-SQL Tasks. in HT 2022: 33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022. HT 2022: 33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022, Association for Computing Machinery (ACM), pp. 154-165, 33rd ACM Conference on Hypertext and Social Media, HT 2022 - Co-located with ACM WebSci 2022 and ACM UMAP 2022, Virtual, Online, Spain, 28/06/22. https://doi.org/10.1145/3511095.3531282

Exploring the Feasibility of Crowd-Powered Decomposition of Complex User Questions in Text-to-SQL Tasks. / Salimzadeh, Sara ; Gadiraju, Ujwal; Hauff, Claudia et al.
HT 2022: 33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022. Association for Computing Machinery (ACM), 2022. p. 154-165 (HT 2022: 33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Exploring the Feasibility of Crowd-Powered Decomposition of Complex User Questions in Text-to-SQL Tasks

AU - Salimzadeh, Sara

AU - Gadiraju, Ujwal

AU - Hauff, Claudia

AU - Van Deursen, Arie

PY - 2022

Y1 - 2022

N2 - Natural Language Interfaces to Databases (NLIDB), also known as Text-to-SQL models, enable users with different levels of knowledge in Structured Query Language (SQL) to access relational databases without any programming effort. By translating natural languages into SQL query, not only do NLIDBs minimize the burden of memorizing the schema of databases and writing complex SQL queries, but they also allow non-experts to acquire information from databases in natural languages. However, existing NLIDBs largely fail to translate natural languages to SQL when they are complex, preventing them from being deployed in real-world scenarios and generalizing across unseen complex databases. In this paper, we explored the feasibility of decomposing complex user questions into multiple sub-questions - each with a reduced complexity - as a means to circumvent the problem of complex SQL generation. We investigated the feasibility of decomposing complex user questions in a manner that each sub-question is simple enough for existing NLIDBs to generate correct SQL queries, using non-expert crowd workers in juxtaposition with SQL experts. Through an empirical study on an NLIDB benchmark dataset, we found that crowd-powered decomposition of complex user questions led to an accuracy boost of an existing Text-to-SQL pipeline from 30% to 59% (96% accuracy boost). Similarly, decomposition by SQL experts resulted in boosting the accuracy to 76% (153% accuracy boost). Our findings suggest that crowd-powered decomposition can be a scalable alternative to producing the training data necessary to build machine learning models that can automatically decompose complex user questions, thereby improving Text-to-SQL pipelines.

AB - Natural Language Interfaces to Databases (NLIDB), also known as Text-to-SQL models, enable users with different levels of knowledge in Structured Query Language (SQL) to access relational databases without any programming effort. By translating natural languages into SQL query, not only do NLIDBs minimize the burden of memorizing the schema of databases and writing complex SQL queries, but they also allow non-experts to acquire information from databases in natural languages. However, existing NLIDBs largely fail to translate natural languages to SQL when they are complex, preventing them from being deployed in real-world scenarios and generalizing across unseen complex databases. In this paper, we explored the feasibility of decomposing complex user questions into multiple sub-questions - each with a reduced complexity - as a means to circumvent the problem of complex SQL generation. We investigated the feasibility of decomposing complex user questions in a manner that each sub-question is simple enough for existing NLIDBs to generate correct SQL queries, using non-expert crowd workers in juxtaposition with SQL experts. Through an empirical study on an NLIDB benchmark dataset, we found that crowd-powered decomposition of complex user questions led to an accuracy boost of an existing Text-to-SQL pipeline from 30% to 59% (96% accuracy boost). Similarly, decomposition by SQL experts resulted in boosting the accuracy to 76% (153% accuracy boost). Our findings suggest that crowd-powered decomposition can be a scalable alternative to producing the training data necessary to build machine learning models that can automatically decompose complex user questions, thereby improving Text-to-SQL pipelines.

KW - Corpus Annotation

KW - Crowdsourcing

KW - Human Computation

KW - Natural Language Interface to Databases

KW - Semantic Parsing

KW - Text-to-SQL

UR - http://www.scopus.com/inward/record.url?scp=85134212201&partnerID=8YFLogxK

U2 - 10.1145/3511095.3531282

DO - 10.1145/3511095.3531282

M3 - Conference contribution

AN - SCOPUS:85134212201

T3 - HT 2022: 33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022

SP - 154

EP - 165

BT - HT 2022

PB - Association for Computing Machinery (ACM)

T2 - 33rd ACM Conference on Hypertext and Social Media, HT 2022 - Co-located with ACM WebSci 2022 and ACM UMAP 2022

Y2 - 28 June 2022 through 1 July 2022

ER -

Salimzadeh S , Gadiraju U, Hauff C, Van Deursen A. Exploring the Feasibility of Crowd-Powered Decomposition of Complex User Questions in Text-to-SQL Tasks. In HT 2022: 33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022. Association for Computing Machinery (ACM). 2022. p. 154-165. (HT 2022: 33rd ACM Conference on Hypertext and Social Media - Co-located with ACM WebSci 2022 and ACM UMAP 2022). doi: 10.1145/3511095.3531282

Exploring the Feasibility of Crowd-Powered Decomposition of Complex User Questions in Text-to-SQL Tasks

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this