Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data

Sobhan Sarkar; Anima Pramanik; J. Maiti; Genserik Reniers

doi:10.1016/j.ssci.2020.104616

Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data

Sobhan Sarkar^*, Anima Pramanik, J. Maiti, Genserik Reniers

^*Corresponding author for this work

Safety and Security Science

Research output: Contribution to journal › Article › Scientific › peer-review

71 Citations (Scopus)

Abstract

Although the utility of the machine learning (ML) techniques is established in occupational accident domain using reactive data, its exploration in predicting injury severity using both reactive and proactive data is new. This necessitates the investigation of the significance of both types of data in prediction of injury severity using ML techniques. In addition, the unstructured texts, and class-imbalance in data often create difficulty in analysis. Therefore, to address the above-mentioned issues, two types of data, namely investigation report (i.e., reactive data) and inspection report (i.e., proactive data), collected from a steel plant, are used in this study. The datasets are merged together for generating mixed dataset. Topic modeling is used to handle the unstructured texts. A total of four oversampling algorithms, namely Synthetic Minority Over-sampling Technique (SMOTE), borderline SMOTE (BLSMOTE), Majority Weighted Minority Oversampling Technique (MWMOTE), and k-means SMOTE (KMSMOTE) have been used separately to handle the class imbalance issue. Thereafter, a set of six prediction algorithms, namely support vector machine, artificial neural network, Naíve Bayes, k-nearest neighbour, classification and regression tree analysis, and random forest have been used on reactive and mixed datasets separately for injury severity prediction. The results reveal that KMSMOTE performs better than others in balancing datasets and therefore, helps in achieving higher prediction in terms of average recall, F1-score and geometric mean. In addition, it is also statistically shown that prediction of injury severity is significantly higher using mixed dataset than reactive dataset only. Finally, a set of 19 crisp safety decision rules are generated using tolerance rough set approach (TRSA), which can explain the factors responsible for injury severity outcomes, namely ‘Fatal’, ‘Medical case’, and ‘First-aid’.

Original language	English
Article number	104616
Journal	Safety Science
Volume	125
DOIs	https://doi.org/10.1016/j.ssci.2020.104616
Publication status	Published - 2020

Keywords

Class-imbalance
Classification algorithms
Injury severity prediction
Oversampling techniques
Reactive and proactive data
Tolerance rough set approach (TRSA)

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1016/j.ssci.2020.104616

Cite this

@article{1643aade66514223b4155dbdb2ab476d,

title = "Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data",

abstract = "Although the utility of the machine learning (ML) techniques is established in occupational accident domain using reactive data, its exploration in predicting injury severity using both reactive and proactive data is new. This necessitates the investigation of the significance of both types of data in prediction of injury severity using ML techniques. In addition, the unstructured texts, and class-imbalance in data often create difficulty in analysis. Therefore, to address the above-mentioned issues, two types of data, namely investigation report (i.e., reactive data) and inspection report (i.e., proactive data), collected from a steel plant, are used in this study. The datasets are merged together for generating mixed dataset. Topic modeling is used to handle the unstructured texts. A total of four oversampling algorithms, namely Synthetic Minority Over-sampling Technique (SMOTE), borderline SMOTE (BLSMOTE), Majority Weighted Minority Oversampling Technique (MWMOTE), and k-means SMOTE (KMSMOTE) have been used separately to handle the class imbalance issue. Thereafter, a set of six prediction algorithms, namely support vector machine, artificial neural network, Na{\'i}ve Bayes, k-nearest neighbour, classification and regression tree analysis, and random forest have been used on reactive and mixed datasets separately for injury severity prediction. The results reveal that KMSMOTE performs better than others in balancing datasets and therefore, helps in achieving higher prediction in terms of average recall, F1-score and geometric mean. In addition, it is also statistically shown that prediction of injury severity is significantly higher using mixed dataset than reactive dataset only. Finally, a set of 19 crisp safety decision rules are generated using tolerance rough set approach (TRSA), which can explain the factors responsible for injury severity outcomes, namely {\textquoteleft}Fatal{\textquoteright}, {\textquoteleft}Medical case{\textquoteright}, and {\textquoteleft}First-aid{\textquoteright}.",

keywords = "Class-imbalance, Classification algorithms, Injury severity prediction, Oversampling techniques, Reactive and proactive data, Tolerance rough set approach (TRSA)",

author = "Sobhan Sarkar and Anima Pramanik and J. Maiti and Genserik Reniers",

year = "2020",

doi = "10.1016/j.ssci.2020.104616",

language = "English",

volume = "125",

journal = "Safety Science",

issn = "0925-7535",

publisher = "Elsevier",

}

TY - JOUR

T1 - Predicting and analyzing injury severity

T2 - A machine learning-based approach using class-imbalanced proactive and reactive data

AU - Sarkar, Sobhan

AU - Pramanik, Anima

AU - Maiti, J.

AU - Reniers, Genserik

PY - 2020

Y1 - 2020

N2 - Although the utility of the machine learning (ML) techniques is established in occupational accident domain using reactive data, its exploration in predicting injury severity using both reactive and proactive data is new. This necessitates the investigation of the significance of both types of data in prediction of injury severity using ML techniques. In addition, the unstructured texts, and class-imbalance in data often create difficulty in analysis. Therefore, to address the above-mentioned issues, two types of data, namely investigation report (i.e., reactive data) and inspection report (i.e., proactive data), collected from a steel plant, are used in this study. The datasets are merged together for generating mixed dataset. Topic modeling is used to handle the unstructured texts. A total of four oversampling algorithms, namely Synthetic Minority Over-sampling Technique (SMOTE), borderline SMOTE (BLSMOTE), Majority Weighted Minority Oversampling Technique (MWMOTE), and k-means SMOTE (KMSMOTE) have been used separately to handle the class imbalance issue. Thereafter, a set of six prediction algorithms, namely support vector machine, artificial neural network, Naíve Bayes, k-nearest neighbour, classification and regression tree analysis, and random forest have been used on reactive and mixed datasets separately for injury severity prediction. The results reveal that KMSMOTE performs better than others in balancing datasets and therefore, helps in achieving higher prediction in terms of average recall, F1-score and geometric mean. In addition, it is also statistically shown that prediction of injury severity is significantly higher using mixed dataset than reactive dataset only. Finally, a set of 19 crisp safety decision rules are generated using tolerance rough set approach (TRSA), which can explain the factors responsible for injury severity outcomes, namely ‘Fatal’, ‘Medical case’, and ‘First-aid’.

AB - Although the utility of the machine learning (ML) techniques is established in occupational accident domain using reactive data, its exploration in predicting injury severity using both reactive and proactive data is new. This necessitates the investigation of the significance of both types of data in prediction of injury severity using ML techniques. In addition, the unstructured texts, and class-imbalance in data often create difficulty in analysis. Therefore, to address the above-mentioned issues, two types of data, namely investigation report (i.e., reactive data) and inspection report (i.e., proactive data), collected from a steel plant, are used in this study. The datasets are merged together for generating mixed dataset. Topic modeling is used to handle the unstructured texts. A total of four oversampling algorithms, namely Synthetic Minority Over-sampling Technique (SMOTE), borderline SMOTE (BLSMOTE), Majority Weighted Minority Oversampling Technique (MWMOTE), and k-means SMOTE (KMSMOTE) have been used separately to handle the class imbalance issue. Thereafter, a set of six prediction algorithms, namely support vector machine, artificial neural network, Naíve Bayes, k-nearest neighbour, classification and regression tree analysis, and random forest have been used on reactive and mixed datasets separately for injury severity prediction. The results reveal that KMSMOTE performs better than others in balancing datasets and therefore, helps in achieving higher prediction in terms of average recall, F1-score and geometric mean. In addition, it is also statistically shown that prediction of injury severity is significantly higher using mixed dataset than reactive dataset only. Finally, a set of 19 crisp safety decision rules are generated using tolerance rough set approach (TRSA), which can explain the factors responsible for injury severity outcomes, namely ‘Fatal’, ‘Medical case’, and ‘First-aid’.

KW - Class-imbalance

KW - Classification algorithms

KW - Injury severity prediction

KW - Oversampling techniques

KW - Reactive and proactive data

KW - Tolerance rough set approach (TRSA)

UR - http://www.scopus.com/inward/record.url?scp=85079128051&partnerID=8YFLogxK

U2 - 10.1016/j.ssci.2020.104616

DO - 10.1016/j.ssci.2020.104616

M3 - Article

AN - SCOPUS:85079128051

SN - 0925-7535

VL - 125

JO - Safety Science

JF - Safety Science

M1 - 104616

ER -

Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data

Abstract

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this