TY - JOUR
T1 - Predicting and analyzing injury severity
T2 - A machine learning-based approach using class-imbalanced proactive and reactive data
AU - Sarkar, Sobhan
AU - Pramanik, Anima
AU - Maiti, J.
AU - Reniers, Genserik
PY - 2020
Y1 - 2020
N2 - Although the utility of the machine learning (ML) techniques is established in occupational accident domain using reactive data, its exploration in predicting injury severity using both reactive and proactive data is new. This necessitates the investigation of the significance of both types of data in prediction of injury severity using ML techniques. In addition, the unstructured texts, and class-imbalance in data often create difficulty in analysis. Therefore, to address the above-mentioned issues, two types of data, namely investigation report (i.e., reactive data) and inspection report (i.e., proactive data), collected from a steel plant, are used in this study. The datasets are merged together for generating mixed dataset. Topic modeling is used to handle the unstructured texts. A total of four oversampling algorithms, namely Synthetic Minority Over-sampling Technique (SMOTE), borderline SMOTE (BLSMOTE), Majority Weighted Minority Oversampling Technique (MWMOTE), and k-means SMOTE (KMSMOTE) have been used separately to handle the class imbalance issue. Thereafter, a set of six prediction algorithms, namely support vector machine, artificial neural network, Naíve Bayes, k-nearest neighbour, classification and regression tree analysis, and random forest have been used on reactive and mixed datasets separately for injury severity prediction. The results reveal that KMSMOTE performs better than others in balancing datasets and therefore, helps in achieving higher prediction in terms of average recall, F1-score and geometric mean. In addition, it is also statistically shown that prediction of injury severity is significantly higher using mixed dataset than reactive dataset only. Finally, a set of 19 crisp safety decision rules are generated using tolerance rough set approach (TRSA), which can explain the factors responsible for injury severity outcomes, namely ‘Fatal’, ‘Medical case’, and ‘First-aid’.
AB - Although the utility of the machine learning (ML) techniques is established in occupational accident domain using reactive data, its exploration in predicting injury severity using both reactive and proactive data is new. This necessitates the investigation of the significance of both types of data in prediction of injury severity using ML techniques. In addition, the unstructured texts, and class-imbalance in data often create difficulty in analysis. Therefore, to address the above-mentioned issues, two types of data, namely investigation report (i.e., reactive data) and inspection report (i.e., proactive data), collected from a steel plant, are used in this study. The datasets are merged together for generating mixed dataset. Topic modeling is used to handle the unstructured texts. A total of four oversampling algorithms, namely Synthetic Minority Over-sampling Technique (SMOTE), borderline SMOTE (BLSMOTE), Majority Weighted Minority Oversampling Technique (MWMOTE), and k-means SMOTE (KMSMOTE) have been used separately to handle the class imbalance issue. Thereafter, a set of six prediction algorithms, namely support vector machine, artificial neural network, Naíve Bayes, k-nearest neighbour, classification and regression tree analysis, and random forest have been used on reactive and mixed datasets separately for injury severity prediction. The results reveal that KMSMOTE performs better than others in balancing datasets and therefore, helps in achieving higher prediction in terms of average recall, F1-score and geometric mean. In addition, it is also statistically shown that prediction of injury severity is significantly higher using mixed dataset than reactive dataset only. Finally, a set of 19 crisp safety decision rules are generated using tolerance rough set approach (TRSA), which can explain the factors responsible for injury severity outcomes, namely ‘Fatal’, ‘Medical case’, and ‘First-aid’.
KW - Class-imbalance
KW - Classification algorithms
KW - Injury severity prediction
KW - Oversampling techniques
KW - Reactive and proactive data
KW - Tolerance rough set approach (TRSA)
UR - http://www.scopus.com/inward/record.url?scp=85079128051&partnerID=8YFLogxK
U2 - 10.1016/j.ssci.2020.104616
DO - 10.1016/j.ssci.2020.104616
M3 - Article
AN - SCOPUS:85079128051
SN - 0925-7535
VL - 125
JO - Safety Science
JF - Safety Science
M1 - 104616
ER -