Test Smells 20 Years Later: Detectability, Validity, and Reliability

A. Panichella; Sebastiano Panichella; Gordon Fraser; Anand Ashok Sawant; Vincent Hellendoorn

doi:10.1007/s10664-022-10207-5

Test Smells 20 Years Later: Detectability, Validity, and Reliability

A. Panichella, Sebastiano Panichella, Gordon Fraser, Anand Ashok Sawant, Vincent Hellendoorn

Software Engineering

Research output: Contribution to journal › Article › Scientific › peer-review

7 Citations (Scopus)

175 Downloads (Pure)

Abstract

Test smells aim to capture design issues in test code that reduces its maintainability. These have been extensively studied and generally found quite prevalent in both human-written and automatically generated test-cases. However, most evidence of prevalence is based on specific static detection rules. Although those are based on the original, conceptual definitions of the various test smells, recent empirical studies indicate that developers perceive warnings raised by detection tools as overly strict and non-representative of the maintainability and quality of test suites.
This leads us to re-assess test smell detection tools' detection accuracy and investigate the prevalence and detectability of test smells more broadly.
Specifically, we construct a hand-annotated dataset spanning hundreds of test suites both written by developers and generated by two test generation tools (EvoSuite and JTExpert) and performed a multi-stage, cross-validated manual analysis to identify the presence of six types of test smells in these. We then use this manual labeling to benchmark the performance and external validity of two test smell detection tools -- one widely used in prior work and one recently introduced with the express goal to match developer perceptions of test smells.
Our results primarily show that the current vocabulary of test smells is highly mismatched to real concerns: multiple smells were ubiquitous on developer-written tests but virtually never correlated with semantic or maintainability flaws; machine-generated tests actually often scored better, but in reality, suffered from a host of problems not well-captured by current test smells. Current test smell detection strategies poorly characterized the issues in these automatically generated test suites; in particular, the older tool's detection strategies misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives). We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues. Our findings suggest the need for (i) more appropriate metrics to match development practice, (ii) more accurate detection strategies to be evaluated primarily in industrial contexts.

Original language	English
Article number	170
Journal	Empirical Software Engineering
Volume	27
Issue number	7
DOIs	https://doi.org/10.1007/s10664-022-10207-5
Publication status	Published - 2022

Keywords

Test Case Generation
Test Smells
Software Testing
software maintenance
Software Quality

Access to Document

10.1007/s10664-022-10207-5

s10664-022-10207-5Final published version, 2.88 MBLicence: CC BY

Cite this

@article{7891403ee7bf4f8c8ab62687fcaac106,

title = "Test Smells 20 Years Later: Detectability, Validity, and Reliability",

abstract = "Test smells aim to capture design issues in test code that reduces its maintainability. These have been extensively studied and generally found quite prevalent in both human-written and automatically generated test-cases. However, most evidence of prevalence is based on specific static detection rules. Although those are based on the original, conceptual definitions of the various test smells, recent empirical studies indicate that developers perceive warnings raised by detection tools as overly strict and non-representative of the maintainability and quality of test suites. This leads us to re-assess test smell detection tools' detection accuracy and investigate the prevalence and detectability of test smells more broadly.Specifically, we construct a hand-annotated dataset spanning hundreds of test suites both written by developers and generated by two test generation tools (EvoSuite and JTExpert) and performed a multi-stage, cross-validated manual analysis to identify the presence of six types of test smells in these. We then use this manual labeling to benchmark the performance and external validity of two test smell detection tools -- one widely used in prior work and one recently introduced with the express goal to match developer perceptions of test smells.Our results primarily show that the current vocabulary of test smells is highly mismatched to real concerns: multiple smells were ubiquitous on developer-written tests but virtually never correlated with semantic or maintainability flaws; machine-generated tests actually often scored better, but in reality, suffered from a host of problems not well-captured by current test smells. Current test smell detection strategies poorly characterized the issues in these automatically generated test suites; in particular, the older tool's detection strategies misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives). We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues. Our findings suggest the need for (i) more appropriate metrics to match development practice, (ii) more accurate detection strategies to be evaluated primarily in industrial contexts.",

keywords = "Test Case Generation, Test Smells, Software Testing, software maintenance, Software Quality",

author = "A. Panichella and Sebastiano Panichella and Gordon Fraser and Sawant, {Anand Ashok} and Vincent Hellendoorn",

year = "2022",

doi = "10.1007/s10664-022-10207-5",

language = "English",

volume = "27",

journal = "Empirical Software Engineering",

issn = "1382-3256",

publisher = "Springer",

number = "7",

}

TY - JOUR

T1 - Test Smells 20 Years Later: Detectability, Validity, and Reliability

AU - Panichella, A.

AU - Panichella, Sebastiano

AU - Fraser, Gordon

AU - Sawant, Anand Ashok

AU - Hellendoorn, Vincent

PY - 2022

Y1 - 2022

N2 - Test smells aim to capture design issues in test code that reduces its maintainability. These have been extensively studied and generally found quite prevalent in both human-written and automatically generated test-cases. However, most evidence of prevalence is based on specific static detection rules. Although those are based on the original, conceptual definitions of the various test smells, recent empirical studies indicate that developers perceive warnings raised by detection tools as overly strict and non-representative of the maintainability and quality of test suites. This leads us to re-assess test smell detection tools' detection accuracy and investigate the prevalence and detectability of test smells more broadly.Specifically, we construct a hand-annotated dataset spanning hundreds of test suites both written by developers and generated by two test generation tools (EvoSuite and JTExpert) and performed a multi-stage, cross-validated manual analysis to identify the presence of six types of test smells in these. We then use this manual labeling to benchmark the performance and external validity of two test smell detection tools -- one widely used in prior work and one recently introduced with the express goal to match developer perceptions of test smells.Our results primarily show that the current vocabulary of test smells is highly mismatched to real concerns: multiple smells were ubiquitous on developer-written tests but virtually never correlated with semantic or maintainability flaws; machine-generated tests actually often scored better, but in reality, suffered from a host of problems not well-captured by current test smells. Current test smell detection strategies poorly characterized the issues in these automatically generated test suites; in particular, the older tool's detection strategies misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives). We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues. Our findings suggest the need for (i) more appropriate metrics to match development practice, (ii) more accurate detection strategies to be evaluated primarily in industrial contexts.

AB - Test smells aim to capture design issues in test code that reduces its maintainability. These have been extensively studied and generally found quite prevalent in both human-written and automatically generated test-cases. However, most evidence of prevalence is based on specific static detection rules. Although those are based on the original, conceptual definitions of the various test smells, recent empirical studies indicate that developers perceive warnings raised by detection tools as overly strict and non-representative of the maintainability and quality of test suites. This leads us to re-assess test smell detection tools' detection accuracy and investigate the prevalence and detectability of test smells more broadly.Specifically, we construct a hand-annotated dataset spanning hundreds of test suites both written by developers and generated by two test generation tools (EvoSuite and JTExpert) and performed a multi-stage, cross-validated manual analysis to identify the presence of six types of test smells in these. We then use this manual labeling to benchmark the performance and external validity of two test smell detection tools -- one widely used in prior work and one recently introduced with the express goal to match developer perceptions of test smells.Our results primarily show that the current vocabulary of test smells is highly mismatched to real concerns: multiple smells were ubiquitous on developer-written tests but virtually never correlated with semantic or maintainability flaws; machine-generated tests actually often scored better, but in reality, suffered from a host of problems not well-captured by current test smells. Current test smell detection strategies poorly characterized the issues in these automatically generated test suites; in particular, the older tool's detection strategies misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives). We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues. Our findings suggest the need for (i) more appropriate metrics to match development practice, (ii) more accurate detection strategies to be evaluated primarily in industrial contexts.

KW - Test Case Generation

KW - Test Smells

KW - Software Testing

KW - software maintenance

KW - Software Quality

UR - http://www.scopus.com/inward/record.url?scp=85137672681&partnerID=8YFLogxK

U2 - 10.1007/s10664-022-10207-5

DO - 10.1007/s10664-022-10207-5

M3 - Article

SN - 1382-3256

VL - 27

JO - Empirical Software Engineering

JF - Empirical Software Engineering

IS - 7

M1 - 170

ER -

Test Smells 20 Years Later: Detectability, Validity, and Reliability

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities

Cite this