TY - JOUR
T1 - Evaluating the generalizability and transferability of water distribution deterioration models
AU - Daulat, Shamsuddin
AU - Rokstad, Marius Møller
AU - Bruaset, Stian
AU - Langeveld, Jeroen
AU - Tscheikner-Gratl, Franz
PY - 2023
Y1 - 2023
N2 - Small utilities often lack the required amount of data to train machine learning-based models to predict pipe failures, and hence are unable to harness the possibilities and predictive power of machine learning. This study evaluates the generalizability and transferability of a machine learning model to see if small utilities can benefit from the data and models of other utilities. Using nine Norwegian utilities’ datasets, we trained nine global models (by merging multiple datasets) and nine local models (by utilizing each utility's dataset) using random survival forest. Several pre-processing techniques including addressing left-truncated break data and break data scarcity are also presented. The global models and three of the local models were tested to predict the pipe failure of the utilities which were not included in their training datasets. The results indicate that the global models can predict other utilities with sufficient accuracy while local models have some limitations. However, if a representative utility with a sufficiently large (and information rich) dataset is selected, its model can predict the other utility's pipe breaks as accurate as the global models. Furthermore, survival curves for defined cohorts as proxies for uncertainty, and variable importance show that pipes with and without previous breaks behave extremely different. With the understanding of models’ generalizability and transferability, small utilities can benefit from the data and models of other utilities.
AB - Small utilities often lack the required amount of data to train machine learning-based models to predict pipe failures, and hence are unable to harness the possibilities and predictive power of machine learning. This study evaluates the generalizability and transferability of a machine learning model to see if small utilities can benefit from the data and models of other utilities. Using nine Norwegian utilities’ datasets, we trained nine global models (by merging multiple datasets) and nine local models (by utilizing each utility's dataset) using random survival forest. Several pre-processing techniques including addressing left-truncated break data and break data scarcity are also presented. The global models and three of the local models were tested to predict the pipe failure of the utilities which were not included in their training datasets. The results indicate that the global models can predict other utilities with sufficient accuracy while local models have some limitations. However, if a representative utility with a sufficiently large (and information rich) dataset is selected, its model can predict the other utility's pipe breaks as accurate as the global models. Furthermore, survival curves for defined cohorts as proxies for uncertainty, and variable importance show that pipes with and without previous breaks behave extremely different. With the understanding of models’ generalizability and transferability, small utilities can benefit from the data and models of other utilities.
KW - Data preprocessing
KW - Random survival forests
KW - Survival functions
KW - Uncertainties
KW - Variable importance
UR - http://www.scopus.com/inward/record.url?scp=85170414328&partnerID=8YFLogxK
U2 - 10.1016/j.ress.2023.109611
DO - 10.1016/j.ress.2023.109611
M3 - Article
AN - SCOPUS:85170414328
SN - 0951-8320
VL - 241
JO - Reliability Engineering and System Safety
JF - Reliability Engineering and System Safety
M1 - 109611
ER -