Improving mathematics assessment readability: Do large language models help?

Nirmal Patel; Pooja Nagpal; Tirth Shah; Aditya Sharma; Shrey Malvi; Derek Lomas

doi:10.1111/jcal.12776

Improving mathematics assessment readability: Do large language models help?

Nirmal Patel^*, Pooja Nagpal, Tirth Shah, Aditya Sharma, Shrey Malvi, Derek Lomas

^*Corresponding author for this work

Design Aesthetics

Research output: Contribution to journal › Article › Scientific › peer-review

3 Citations (Scopus)

27 Downloads (Pure)

Abstract

Background: Readability metrics provide us with an objective and efficient way to assess the quality of educational texts. We can use the readability measures for finding assessment items that are difficult to read for a given grade level. Hard-to-read math word problems can put some students at a disadvantage if they are behind in their literacy learning. Despite their math abilities, these students can perform poorly on difficult-to-read word problems because of their poor reading skills. Less readable math tests can create equity issues for students who are relatively new to the language of assessment. Less readable test items can also affect the assessment's construct validity by partially measuring reading comprehension. Objectives: This study shows how large language models help us improve the readability of math assessment items. Methods: We analysed 250 test items from grades 3 to 5 of EngageNY, an open-source curriculum. We used the GPT-3 AI system to simplify the text of these math word problems. We used text prompts and the few-shot learning method for the simplification task. Results and Conclusions: On average, GPT-3 AI produced output passages that showed improvements in readability metrics, but the outputs had a large amount of noise and were often unrelated to the input. We used thresholds over text similarity metrics and changes in readability measures to filter out the noise. We found meaningful simplifications that can be given to item authors as suggestions for improvement. Takeaways: GPT-3 AI is capable of simplifying hard-to-read math word problems. The model generates noisy simplifications using text prompts or few-shot learning methods. The noise can be filtered using text similarity and readability measures. The meaningful simplifications AI produces are sound but not ready to be used as a direct replacement for the original items. To improve test quality, simplifications can be suggested to item authors at the time of digital question authoring.

Original language	English
Pages (from-to)	804-822
Number of pages	19
Journal	Journal of Computer Assisted Learning
Volume	39
Issue number	3
DOIs	https://doi.org/10.1111/jcal.12776
Publication status	Published - 2023

Bibliographical note

Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Keywords

GPT-3
mathematics assessment
readability
text simplification

Access to Document

10.1111/jcal.12776

Computer Assisted Learning - 2023 - Patel - Improving mathematics assessment readability Do large language models helpFinal published version, 2.62 MB

Cite this

@article{2c99d72359034f39b2bffeb40daf3ec7,

title = "Improving mathematics assessment readability: Do large language models help?",

abstract = "Background: Readability metrics provide us with an objective and efficient way to assess the quality of educational texts. We can use the readability measures for finding assessment items that are difficult to read for a given grade level. Hard-to-read math word problems can put some students at a disadvantage if they are behind in their literacy learning. Despite their math abilities, these students can perform poorly on difficult-to-read word problems because of their poor reading skills. Less readable math tests can create equity issues for students who are relatively new to the language of assessment. Less readable test items can also affect the assessment's construct validity by partially measuring reading comprehension. Objectives: This study shows how large language models help us improve the readability of math assessment items. Methods: We analysed 250 test items from grades 3 to 5 of EngageNY, an open-source curriculum. We used the GPT-3 AI system to simplify the text of these math word problems. We used text prompts and the few-shot learning method for the simplification task. Results and Conclusions: On average, GPT-3 AI produced output passages that showed improvements in readability metrics, but the outputs had a large amount of noise and were often unrelated to the input. We used thresholds over text similarity metrics and changes in readability measures to filter out the noise. We found meaningful simplifications that can be given to item authors as suggestions for improvement. Takeaways: GPT-3 AI is capable of simplifying hard-to-read math word problems. The model generates noisy simplifications using text prompts or few-shot learning methods. The noise can be filtered using text similarity and readability measures. The meaningful simplifications AI produces are sound but not ready to be used as a direct replacement for the original items. To improve test quality, simplifications can be suggested to item authors at the time of digital question authoring.",

keywords = "GPT-3, mathematics assessment, readability, text simplification",

author = "Nirmal Patel and Pooja Nagpal and Tirth Shah and Aditya Sharma and Shrey Malvi and Derek Lomas",

note = "Green Open Access added to TU Delft Institutional Repository {\textquoteleft}You share, we take care!{\textquoteright} – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.",

year = "2023",

doi = "10.1111/jcal.12776",

language = "English",

volume = "39",

pages = "804--822",

journal = "Journal of Computer Assisted Learning",

issn = "0266-4909",

publisher = "Wiley-Blackwell",

number = "3",

}

TY - JOUR

T1 - Improving mathematics assessment readability

T2 - Do large language models help?

AU - Patel, Nirmal

AU - Nagpal, Pooja

AU - Shah, Tirth

AU - Sharma, Aditya

AU - Malvi, Shrey

AU - Lomas, Derek

N1 - Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

PY - 2023

Y1 - 2023

N2 - Background: Readability metrics provide us with an objective and efficient way to assess the quality of educational texts. We can use the readability measures for finding assessment items that are difficult to read for a given grade level. Hard-to-read math word problems can put some students at a disadvantage if they are behind in their literacy learning. Despite their math abilities, these students can perform poorly on difficult-to-read word problems because of their poor reading skills. Less readable math tests can create equity issues for students who are relatively new to the language of assessment. Less readable test items can also affect the assessment's construct validity by partially measuring reading comprehension. Objectives: This study shows how large language models help us improve the readability of math assessment items. Methods: We analysed 250 test items from grades 3 to 5 of EngageNY, an open-source curriculum. We used the GPT-3 AI system to simplify the text of these math word problems. We used text prompts and the few-shot learning method for the simplification task. Results and Conclusions: On average, GPT-3 AI produced output passages that showed improvements in readability metrics, but the outputs had a large amount of noise and were often unrelated to the input. We used thresholds over text similarity metrics and changes in readability measures to filter out the noise. We found meaningful simplifications that can be given to item authors as suggestions for improvement. Takeaways: GPT-3 AI is capable of simplifying hard-to-read math word problems. The model generates noisy simplifications using text prompts or few-shot learning methods. The noise can be filtered using text similarity and readability measures. The meaningful simplifications AI produces are sound but not ready to be used as a direct replacement for the original items. To improve test quality, simplifications can be suggested to item authors at the time of digital question authoring.

AB - Background: Readability metrics provide us with an objective and efficient way to assess the quality of educational texts. We can use the readability measures for finding assessment items that are difficult to read for a given grade level. Hard-to-read math word problems can put some students at a disadvantage if they are behind in their literacy learning. Despite their math abilities, these students can perform poorly on difficult-to-read word problems because of their poor reading skills. Less readable math tests can create equity issues for students who are relatively new to the language of assessment. Less readable test items can also affect the assessment's construct validity by partially measuring reading comprehension. Objectives: This study shows how large language models help us improve the readability of math assessment items. Methods: We analysed 250 test items from grades 3 to 5 of EngageNY, an open-source curriculum. We used the GPT-3 AI system to simplify the text of these math word problems. We used text prompts and the few-shot learning method for the simplification task. Results and Conclusions: On average, GPT-3 AI produced output passages that showed improvements in readability metrics, but the outputs had a large amount of noise and were often unrelated to the input. We used thresholds over text similarity metrics and changes in readability measures to filter out the noise. We found meaningful simplifications that can be given to item authors as suggestions for improvement. Takeaways: GPT-3 AI is capable of simplifying hard-to-read math word problems. The model generates noisy simplifications using text prompts or few-shot learning methods. The noise can be filtered using text similarity and readability measures. The meaningful simplifications AI produces are sound but not ready to be used as a direct replacement for the original items. To improve test quality, simplifications can be suggested to item authors at the time of digital question authoring.

KW - GPT-3

KW - mathematics assessment

KW - readability

KW - text simplification

UR - http://www.scopus.com/inward/record.url?scp=85146646547&partnerID=8YFLogxK

U2 - 10.1111/jcal.12776

DO - 10.1111/jcal.12776

M3 - Article

AN - SCOPUS:85146646547

SN - 0266-4909

VL - 39

SP - 804

EP - 822

JO - Journal of Computer Assisted Learning

JF - Journal of Computer Assisted Learning

IS - 3

ER -

Improving mathematics assessment readability: Do large language models help?

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this