Diabetes Prediction Using The Smote-Cart Framework Model for Imbalanced Data Case

Farah Najidah Noorizan; Nur Anida  Jumadi; Muhamad Amir Irfan Roslan; Li Mun Ng; Manveer Pal Singh3; Yukihiro Ishida

Authors

Farah Najidah Noorizan Universiti Tun Hussein Onn Malaysia
Nur Anida Jumadi Universiti Tun Hussein Onn Malaysia
Muhamad Amir Irfan Roslan Universiti Tun Hussein Onn Malaysia
Ng Li Mun Universiti Tun Hussein Onn Malaysia
Manveer Pal Singh Putra Specialist Hospital Batu Pahat
Yukihiro Ishida SECOND HEART Inc.

Keywords:

Diabetes Mellitus, Synthetic Minority Oversampling Technique, Classification and Regression Tree, Hyperparameter Tuning, Evaluation Metrics

Abstract

Diabetes mellitus (DM) is described by chronic high blood glucose levels, which can result in long-term damage, dysfunction, and organ failure. As a result of technological advancements, many researchers are employing machine learning to predict diabetes. They collect patients’ demographics and health information, organizing them into a dataset. However, in most real-world data, the non-diabetic cases exceed the diabetic cases, contributing to bias in the majority class and resulting in low predictive diabetic cases. Therefore, a Synthetic Minority Oversampling Technique (SMOTE) has been proposed to improve diabetic prediction on the dataset samples before training the Classification and Regression Tree (CART) model. The proposed framework involved the preprocessing step (SMOTE and categorical conversion), CART training, hyperparameter tuning, and evaluation metrics. With a combination of 8 leaf numbers per node, a maximum of 10 splits, and deviance as the split criterion, the model achieves an overall accuracy of 98.72%, a precision of 98.94%, a sensitivity of 98.44%, and an F1-score of 98.67%. In conclusion, the proposed SMOTE-CART framework can effectively address the imbalanced data in a diabetes dataset and improve the accuracy of diabetes prediction.