Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data

Sobhan Sarkar, Anima Pramanik, J. Maiti, Genserik Reniers

Research output: Contribution to journalArticleScientificpeer-review

13 Citations (Scopus)


Although the utility of the machine learning (ML) techniques is established in occupational accident domain using reactive data, its exploration in predicting injury severity using both reactive and proactive data is new. This necessitates the investigation of the significance of both types of data in prediction of injury severity using ML techniques. In addition, the unstructured texts, and class-imbalance in data often create difficulty in analysis. Therefore, to address the above-mentioned issues, two types of data, namely investigation report (i.e., reactive data) and inspection report (i.e., proactive data), collected from a steel plant, are used in this study. The datasets are merged together for generating mixed dataset. Topic modeling is used to handle the unstructured texts. A total of four oversampling algorithms, namely Synthetic Minority Over-sampling Technique (SMOTE), borderline SMOTE (BLSMOTE), Majority Weighted Minority Oversampling Technique (MWMOTE), and k-means SMOTE (KMSMOTE) have been used separately to handle the class imbalance issue. Thereafter, a set of six prediction algorithms, namely support vector machine, artificial neural network, Naíve Bayes, k-nearest neighbour, classification and regression tree analysis, and random forest have been used on reactive and mixed datasets separately for injury severity prediction. The results reveal that KMSMOTE performs better than others in balancing datasets and therefore, helps in achieving higher prediction in terms of average recall, F1-score and geometric mean. In addition, it is also statistically shown that prediction of injury severity is significantly higher using mixed dataset than reactive dataset only. Finally, a set of 19 crisp safety decision rules are generated using tolerance rough set approach (TRSA), which can explain the factors responsible for injury severity outcomes, namely ‘Fatal’, ‘Medical case’, and ‘First-aid’.

Original languageEnglish
Article number104616
JournalSafety Science
Publication statusPublished - 2020


  • Class-imbalance
  • Classification algorithms
  • Injury severity prediction
  • Oversampling techniques
  • Reactive and proactive data
  • Tolerance rough set approach (TRSA)


Dive into the research topics of 'Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data'. Together they form a unique fingerprint.

Cite this