TY - JOUR
T1 - Rethinking and recomputing the value of machine learning models
AU - Sayin, Burcu
AU - Yang, Jie
AU - Chen, Xinyue
AU - Passerini, Andrea
AU - Casati, Fabio
PY - 2025
Y1 - 2025
N2 - In this paper, we argue that the prevailing approach to training and evaluating machine learning models often fails to consider their real-world application within organizational or societal contexts, where they are intended to create beneficial value for people. We propose a shift in perspective, redefining model assessment and selection to emphasize integration into workflows that combine machine predictions with human expertise, particularly in scenarios requiring human intervention for low-confidence predictions. Traditional metrics like accuracy and f-score fail to capture the beneficial value of models in such hybrid settings. To address this, we introduce a simple yet theoretically sound “value” metric that incorporates task-specific costs for correct predictions, errors, and rejections, offering a practical framework for real-world evaluation. Through extensive experiments, we show that existing metrics fail to capture real-world needs, often leading to suboptimal choices in terms of value when used to rank classifiers. Furthermore, we emphasize the critical role of calibration in determining model value, showing that simple, well-calibrated models can often outperform more complex models that are challenging to calibrate.
AB - In this paper, we argue that the prevailing approach to training and evaluating machine learning models often fails to consider their real-world application within organizational or societal contexts, where they are intended to create beneficial value for people. We propose a shift in perspective, redefining model assessment and selection to emphasize integration into workflows that combine machine predictions with human expertise, particularly in scenarios requiring human intervention for low-confidence predictions. Traditional metrics like accuracy and f-score fail to capture the beneficial value of models in such hybrid settings. To address this, we introduce a simple yet theoretically sound “value” metric that incorporates task-specific costs for correct predictions, errors, and rejections, offering a practical framework for real-world evaluation. Through extensive experiments, we show that existing metrics fail to capture real-world needs, often leading to suboptimal choices in terms of value when used to rank classifiers. Furthermore, we emphasize the critical role of calibration in determining model value, showing that simple, well-calibrated models can often outperform more complex models that are challenging to calibrate.
KW - Cost-sensitive learning
KW - Hybrid intelligence
KW - Machine learning
KW - Selective classification
UR - http://www.scopus.com/inward/record.url?scp=105004425830&partnerID=8YFLogxK
U2 - 10.1007/s10462-025-11242-6
DO - 10.1007/s10462-025-11242-6
M3 - Article
AN - SCOPUS:105004425830
SN - 0269-2821
VL - 58
JO - Artificial Intelligence Review
JF - Artificial Intelligence Review
IS - 8
M1 - 238
ER -