Diabetes type 2 classification using machine learning algorithms with up-sampling technique

Recently, the rate of chronic diabetes disease has increased extensively. Diabetes increases blood sugar and other problems like blurred vision, kidney failure, nerve problems, and stroke. Researchers for predicting diabetes have constructed various models. In this paper, gradient boosting classifier, AdaBoost classifier, decision tree classifier, and extra trees classifier machine learning models have been utilized for identifying chronic diabetes disease. The models analyze the PIMA Indian Diabetes dataset (PIMA) and Behavioral Risk Factor Surveillance System (BRFSS) diabetes datasets to classify patients with positive or negative diagnoses. 80% of the datasets are used as training data and 20% as testing data. The extra trees classifier with an area under curve of 0.96% for PIMA and 0.99% for BRFSS datasets outperformed other models. Therefore, it is suggested that healthcare providers can use the ETC model to predict chronic disease.


Introduction
Diabetes is a widespread disease that happens in patients without enough insulin hormone. Human blood sugar is controlled by insulin [1,2]. Increased blood sugar over time without control leads the body to serious health problems like lower limb amputation, blindness, and heart attacks [1][2][3]. In 2019, [3] estimated 1.9 million deaths because of diabetes, and it is the leading cause of death worldwide. In Early diagnosis, doctors analyze diabetes by using their information, but sometimes it might be inaccurate. Healthcare providers collect large amounts of data that cannot be used for effective decisions about diabetes disease [4]. Therefore, predicting and measuring the risk of diabetes disease using computer-based models can crucially reduce healthcare costs [5].
Numerous kinds of research have been devoted to modeling different diseases, including diabetes. Most of them trained the models using various features, for example, pregnancies, gender, age, and BMI [6][7][8].
Lu et al. [5] utilized support vector machine, logistic regression, K-nearest neighbors, Naïve Bayes, decision tree, random forest, XGBoost machine learning, and artificial Page 2 of 10 Hama Saeed Journal of Electrical Systems and Inf Technol (2023) 10:8 neural network deep learning models for predicting diabetes. They stated that RF was the best model, with an accuracy of 91, for predicting diabetes type 2. Various machine learning techniques were evaluated by [6] for classifying diabetes using PIMA diabetes dataset. Linear discriminant analysis was selected by [6] with an accuracy of 77 as the best model versus the other used machine learning techniques. Artificial neural networks, ontology classifiers, K-nearest neighbors, support vector machine, Naive Bayes, decision tree, and logistic regression were utilized to classify diabetes [9]. The ontology classifier with an accuracy of 77.5 was nominated as the best classification model. Farajollahi et al. [10] examined performance comparisons of XGBoost, decision tree, random forest, AdaBoost support vector machine, and logistic regression for diabetes diagnosis. They explained that AdaBoost has the most accuracy of 83 among other models. SVM and ANN are used by [11] for predicting the diagnosis of diabetes. Their model's accuracy was 94.87, higher than the other published works. RF and SVM algorithms were compared by [12] for diabetes prediction using feature selection and dimensionality reduction. Their work's accuracy was 81.4 and 83 for the used models. Other diabetes datasets have been utilized by researchers for classifying diabetes such as the Behavioral Risk Factor Surveillance System (BRFSS) [13] dataset which is considered by [14][15][16]. Furthermore, the above research utilized most machine and deep learning models with different accuracies in diabetes disease prediction. The primary intent of this paper is to select the best machine learning model among four different models, which are not or less used in the literature for diabetes disease classification. PIMA and BRFSS datasets are studied in this paper using DTC, AdaBoost, GBC, ETC machine learning classifiers. The evaluation of the used classifiers is well organized. As a result of the work, the extra trees classifier provides superior ROC of 0.96% for PIMA and 0.99% for BRFSS datasets versus the other used algorithms in diabetes classification. The rest of this work is used for: Materials and methods are explained in part two. Results and discussion are presented in part three. Part four summarizes significant conclusions.

Materials and methods
The proposed framework is illustrated in Fig. 1 and is divided into five phases. Jupiter Notebook with Python is used for the entire implementation of the model. The PIMA dataset has been analyzed using Sklearn, Matplotlib, Pandas, and Numpy packages.

PIMA dataset
PIMA is a popular dataset in diabetes disease classification. It is from the NIDDK Institute [17]. It includes medical predictors of blood pressure, pregnancies, BMI, skin thickness, diabetes pedigree function, insulin, age, glucose, and outcome label. The label predicts positive and negative patients with a diagnosis of diabetes. The dataset initially includes imbalanced 769 samples, and 15 rows of the PIMA are presented in Table 1.

BRFSS dataset
Diabetes dataset from the Behavioral Risk Factor Surveillance System (BRFSS) gathered by the Centers for Disease Control and Prevention (CDC) [13]. It includes 253,680 samples with 21 feature variables such as a smoker, stroke, heart disease or attack, physical activity, fruits, sex, education, and income. The 15 rows of the dataset are presented in Table 2.

Preprocessing
The datasets should be preprocessed before applying them to classifiers. The outliers are removed from the datasets. The outcome and diabetes labels of PIMA and BRFSS datasets are not balanced. Unbalancing data decrease the accuracy of the classifiers. To mitigate this, the up-sampling technique [18] has been used to balance both datasets. After that, 80% of the datasets are used as training data and 20% as testing data randomly using the train-test-split function.

Machine learning classification
In this work, gradient boosting classifier (GBC), decision tree classifier (DTC), extra trees classifier (ETC), and AdaBoost classifier (ABC) machine learning algorithms for classification problems are examined. These models have been selected because they have recorded the highest accuracies and need less computing power than other machine and deep learning models. The extra trees classifier is chosen because it well predicted diabetes disease with area under curve accuracy of 96% for PIMA and 99% for the BRFSS compared to the DTC, GBC, and ABC.

Decision tree classifier (DTC)
It can perform regression, classification, and multioutput tasks because it is a powerful and versatile machine learning algorithm. It is also called the primary ensemble learning model and can fit a complex and large amount of data. It uses trees and validates values from the root till the last node [5,19,20]. This algorithm is used in this paper, and it recorded a ROC accuracy of 0.78 for the PIMA and 0.92 for the BRFSS datasets.

AdaBoost classifier (ABC)
It is a popular learning algorithm used in machine learning performance enhancement. Its base classifier is trained by the initial weight of Xi = 1/n [21]. The classifier's performance depends on the former classifier. n means of training number instances, and Xi denotes the training sample. The final classifier is produced after the training of the base classifiers [21]. This model recorded a ROC accuracy of 0.83 for the PIMA and 0.82 for the BRFSS datasets.

Gradient boosting classifier (GBC)
Gradient boosting classifier like AdaBoost is one of the learning algorithms. Its predecessor was corrected after adding predictors sequentially to the ensemble [21]. It fits the new predictor's residual errors by the new predictor. It can be used in many areas such as ecology and Web search ranking [21][22][23]. After the extra trees classifier, this model recorded an accuracy ROC of 0.90 for the PIMA and 0.82 for the BRFSS datasets.

Extra trees classifier (ETC)
Extra trees classifier is the bagging machine learning algorithm where training dataset samples implement random trees [20,22]. In machine learning, extra trees classifier and extra tree regressor are responsible for constructing extra trees, mitigating overfitting, and improving the classification accuracy [20,24]. Lastly, this model is proposed by this paper since it significantly predicted diabetes disease with area under curve accuracy of 96% and 99% for both datasets compared to the published works in the literature.

Evaluation metric
The models are evaluated using different evaluation metrics, including receiving operating characteristic (ROC) curve, accuracy, F1-Score, precision, and recall [20,23] which are presented in Figs. 2, 3, and Table 3, respectively. where True positive is correctly predicted by the classifier as a positive class True negative is correctly predicted by the classifier as a negative class False positive is incorrectly predicted by the classifier as a positive class False negative is incorrectly predicted by the classifier as a negative class  The results of precision, accuracy, recall and F1-Score for the classifiers are shown in Table 3.

Accuracy is the number of observations predicted correctly
The Receiving Operating Characteristic (ROC) curve compares false and true positives at each threshold. Its plot for the actual false positive versus the positive rates of all classifiers is shown in Figs. 2 and 3.
Higher values for the mentioned five metrics demonstrate better performance for the classifier. Figures 2 and 3 show the ROC results of the four machine learning models using testing/validation data for the datasets. Extra trees classifier has the highest ROC of 0.96% for PIMA and 0.99% for BRFSS datasets respectively, whereas the decision tree and AdaBoost classifiers have the lowest ROC of 0.7888 and 0.8247 for the PIMA and the BRFSS datasets, respectively. Table 3 shows the classification report for the chosen models and datasets, including accuracy, recall, precision, F1-Score, weighted, and unweighted accuracy. It is worth noting that the BRFSS dataset is better recognized than the PIMA dataset by the extra trees classifier. For the PIMA dataset, positive and negative classes with the highest recognition rate of 92% and 87% were detected by the extra tree classifier, whereas the AdaBoost and decision tree classifiers identified the positive and negative classes with the lowest recognition rate of 78 and 71%, respectively.

Results and discussion
For the BRFSS dataset, both extra tree and decision tree classifiers have the greatest recognition rate of 99% for the positive class and 94% for the negative class, correspondingly. However, the gradient boosting and AdaBoost classifiers had the lowest recognition rates, with 71% for the negative class and 77% for the positive class. Table 4 compares the four utilized models with the state-of-the-art research for predicting diabetes disease using PIMA and BRFSS datasets. The largest accuracy of 84.95% was obtained by Lu et al. for the PIMA and an accuracy of 95% was gained by Dinh et al. for the BRFSS datasets. Nevertheless, the proposed model has an excellent prediction with an accuracy of 89% for the PIMA and 96% for the BRFSS datasets among the mentioned papers in the literature.
Finally, I observed that using the up-sampling strategy to balance the previously described imbalanced datasets with the extra tree classifier for diabetes detection yielded the greatest recognition rates among published works in the literature.

Conclusion
Machine learning techniques are considered crucial for disease prediction. In this paper, four machine learning models have been proposed for the classification of diabetes type 2. The PIMA and BRFSS datasets have been used with the help of the up-sampling technique for balancing the dataset. The extra trees classifier with an area under curve of 0.96% for PIMA and 0.99% for BRFSS outperformed other models. The findings of this research confirm that healthcare providers can use the ETC model for predicting chronic diseases. Deep learning models can be utilized in future work to predict other diseases. Also, using data fusion and hybrid models will be studied as well.