Healthcare predictive analytics using machine learning and deep learning techniques: a survey

Badawy, Mohammed; Ramadan, Nagy; Hefny, Hesham Ahmed

doi:10.1186/s43067-023-00108-y

Journal of Electrical Systems and Information Technology

Table 5 A comprehensive comparative study of the previous works

From: Healthcare predictive analytics using machine learning and deep learning techniques: a survey

Disease	study	Methodology	Dataset	Approaches	Findings
Diabetes	97	A framework to develop and evaluate ML classification models for diabetes patient prediction	PIDD	Logistic regression, KNN, SVM, and RF	Logistic regression achieved the highest accuracy with 83% Other factors should be considered for diabetes prediction, like family history of diabetes, smoking habits, and physical inactivity
	98	A created diagnosis system to predict diabetes	Frankfurt Hospital in Germany and PIDD provided by the UCI ML repository	RF, SVM, NB, and DT	SVM achieved the highest accuracy with 83.1% Using a DL approach to predict diabetes may lead to achieving better results
	100	A new model to predict type 2 diabetes	The Australian CBHS health funds dataset. 18,700,000 hospital admission records from 1995 to 2018 for 124,000 de-identified patients	Logistic regression, SVM, NB, KNN, DT, RF, and ANN	RF achieved the highest accuracy of 84.95% among other models This study relies only on the dataset providing hospital admission and discharge summaries from one insurance company
	102	An ML model to predict type 2 diabetes (T2D) occurrence in the following year (Y + 1) using variables in the current year (Y)	Dataset was collected at a private medical institute as electronic health records from 2013 to 2018	Logistic regression–RF–SVM–ensemble machine learning	RF achieved the highest accuracy with 73% Additional data sources should be applied to verify the models developed in this study Should be taken into consideration other additional tests to fasting plasma glucose (FPG)
	103	ML algorithms to predict diabetes	PIDD is taken from the UCI Repository	SVM and NB algorithms	NB achieved the highest accuracy with 91% - The authors acknowledge that they need to extend to the latest dataset that will contain additional attributes and rows
	105	A predictive model for the classification of diabetes	Participants' demographic information, medical history, and contact information are collected via a questionnaire-style data collection form	Logistic regression	- Logistic regression achieved Accuracy with 92% - The authors have not compared the model with other diabetes prediction algorithms
	117	A chronic disease risk prediction framework to predict type 2 diabetes risk	Private healthcare funds based in Australia (It contains 749,000 patients) Cover a span of 6 years between September 2009 and March 2015	Regression, Parameter optimization, and tree classification	-The binary tree classification has achieved the highest accuracy at 86.22% -The source of the dataset is the hospital admission and discharge summary. Therefore, it does not contain general physician (GP) visit information and subsequent diagnoses
	120	A cuckoo search-based deep LSTM classifier for diabetes prediction	PIMA dataset	Deep convLSTM	- The model performs maximal accuracy with 97.59% - The authors noticed more datasets are needed, as well as new approaches to improve the classifier's effectiveness
	127	A method developed based on the RNN algorithm for predicting blood glucose levels for diabetics during a period of one hour	Ohio T1DM dataset for blood glucose level prediction	RNN	The authors point out that they can only evaluate prediction goals with enough glucose level history; thus, they cannot anticipate the beginning levels after a gap, which does not improve the prediction's quality
	130	The authors designed a DL approach for delivering 30-min predictions about future glucose levels	Electronic health records datasets: OhioT1DM from clinical trials and the in silicon dataset from the UVA-Padova simulator	NNs, SVR, and ARX	DRNN model gets the highest performance with the smallest RMSE, MARD and time lag The number of clinical datasets is limited and, however, often restricted. Because certain data fields are manually entered, they are occasionally incorrect
	132	The authors proposed (GluNet), an approach to glucose forecasting	OhioT1DM datasets	CNN	The authors point out that the model does not consider physiological knowledge, and that they need to test GluNet with larger prediction horizons and use it to predict overnight hypoglycemia
	133	A short-term blood glucose prediction model (VMD-IPSO-LSTM)	The data of 56 participants were chosen as experimental data among 451 diabetic Mellitus patients	LSTM	The experiments revealed that it improved prediction accuracy at "30 min, 45 min, and 60 min" The time it takes to estimate glucose levels in the short term will be reduced
	136	A new DL method to increase the reliability and precision of type 1 diabetes predictions	Dataset from 759 people with type 1 diabetes who visited Sheffield Teaching Hospitals between 2013 and 2015	CNNs	The authors point out that in the presence of insufficient data and certain physiological specificities, prediction accuracy deteriorates
	137	The authors constructed a framework for predicting and diagnosing the diabetic	PIMA Dataset	ANN, NB, DT, and DL	DL is regarded as the most effective method for analyzing diabetes, with a 98.07% accuracy rate The technique uses a variety of classifiers to accurately predict the disease, but it failed to diagnose it at an early stage
	99	An ML method for predicting COVID-19	OpenData Resources from Mexico and Brazil	Logistic regression–DT–boosted RF	- The model for Mexico has achieved 93% accuracy, F1 score is 79%, and the Brazil model has a 69% accuracy, F1 score is 75% The authors should be concerned about the usage of authentication and privacy management of the created data
	119	A DL approach that uses chest radiography images to differentiate between patients with mild, pneumonia, and COVID-19 infections	COV-PEN dataset	DNNs (ResNet-50)	- The authors emphasized that tests using a vast and hard dataset encompassing several COVID-19 cases are necessary to establish the efficacy of the suggested system
COVID-19	121	A wavelet-based CNN to handle data limitations in time of COVID-19 fast emergence	Two open-source datasets from the National Institute of Health, North America)	CNN	The authors acknowledge they hope to investigate the effects of other wavelet functions besides the Haar wavelet
	122	A CNN framework for COVID-19 identification	Public CT dataset of 2482 CT images from patients of both classifications	CNN	CNN achieved accuracy with 96.16% and recall of 95.41% The authors stated that the use of the framework should be extended to multimodal medical pictures in the future
	126	Detecting diseases in people whose X-ray had been selected as potential COVID-19 candidates	657 chest X-ray images for the diagnosis of COVID-19	CNN and RNN	The VGG19 model is the most successful one and it has an accuracy rate of 95% The success percentage can be improved, according to the authors, by improving data collection. In addition to chest radiography, lung tomography can be used. The success ratio and performance can be enhanced by creating numerous DL models
	128	A new deep anomaly detection model for fast, reliable screening of COVID-19	X-ray dataset, which contains 100 images from 70 COVID-19 persons and 1431 images from 1008 non-COVID-19 pneumonia subjects	CNN	- Sensitivity of 90.00% specificity of 87.84% or sensitivity of 96.00% with a specificity of 70.65% The authors noted that the model still has certain flaws, such as missing 4% of COVID-19 cases and having a 30% false positive rate. In addition, more clinical data are required to confirm and improve the model's usefulness
	129	COVIDX-Net framework to diagnose COVID-19 in X-ray images automatically	Small dataset of 50 photographs	MobileNetV2, ResNetV2, VGG19, DenseNet201, InceptionV3, Inception, and Xception	- The f1-scores for the VGG19 and (DenseNet) models were 89% and 91%, respectively. With f1-scores of 67%, The InceptionV3 model has the weakest classification performance
	134	A new paradigm for primary COVID-19 detection based on a radiology review of chest radiography or chest X-ray	X-rays from verified COVID-19 patients (408 photographs), confirmed pneumonia patients (4273 images), and healthy people (1590 images) to perform a three-class image classification (1590 images). There are 6271 people in total in the dataset	CNN	- Accuracy ranged from 93.90% to 93.90% The authors will face a restriction, particularly when it comes to adopting such a model on a large scale for practical usage
	135	DL models for predicting the number of COVID-19-positive cases in Indian states	The Ministry of Health and Family Welfare dataset contains time series data for 32 individual confirmed COVID-19 cases in each of the states (28) and union territories (4) since March 14, 2020	RNN-based LSTMs	Bidirectional LSTM produced the best performance in terms of prediction errors, while convolutional LSTM produced the worst performance Daily and weekly forecasts were calculated, and bi-LSTM produced accurate results (error less than 3%) for short-term prediction (1–3 days)
	104	A framework for detecting heart disease in its earliest stages	UCI heart disease dataset	K-means clustering	K-means clustering achieved the accuracy of 94.06% - The authors should apply the proposed technique using more than one algorithm and using more than one dataset
Heart disease	107	A decision-making system that assists with automated predictions about the condition of the patient’s heart	Cleveland Heart Disease dataset	KNN, RF, DT, and NB	- KNN achieved the highest accuracy with 94% The authors should extend the presented technique to leverage more than one dataset and forecast different diseases
	108	A model for predicting heart disease in the earliest stage	Cleveland dataset	NB, SVM, KNN, and DT	- KNN achieved the highest accuracy with 90.79%
	109	An ML model to predict heart disease	Cardiovascular dataset	DT, NB, logistic regression, RF, SVM, and KNN	DT achieved the highest accuracy with 73% The authors highlighted that the ensemble ML techniques employing the CVD dataset can generate a better illness prediction model
	110	A framework to improve prediction accuracy for heart disease	Cardiovascular study on residents of the town of Framingham, Massachusetts. Contains different variables like age, gender, sex, chest pain, slope, and target	Logistic regression algorithm, Scikit-Learn in ML	Accuracy 87% Needs to optimize time complexity for the used models
	112	Predicting the risk factors that cause heart disease	Cleveland heart disease	K-means clustering algorithm	Age, maximum heart rate, and the chest pain type play a vital role in predicting heart disease The dataset is too small
	113	A prediction model for heart disease survivability using various ML techniques	Cleveland heart disease dataset	DT, RF, logistic regression, SVM, and NB	RF achieved the highest accuracy with 87% RF Forest gives better accuracy on low-dimensional datasets The model could be extended on a distributed environment such as Map–Reduce, Apache Mahout, and HBase
	114	A single model named hybridization combines several algorithms to predict the heart disease	Cleveland heart disease dataset	NB, SVM, KNN, NN, J4.8, RF, and GA. NB and SVM	The proposed model achieved an accuracy of 89.2% The authors noted that the dataset is little; hence, the system was not able to train adequately, so the accuracy of the method was bad
	116	Predicting coronary heart disease prediction based on ML techniques	Sample of males in a heart disease high-risk region of the Western Cape in South Africa (462 instances)	NB, SVM, and DT	- SVM and DT J48 outperformed NB with a Specificity rate of 82% - SVM and DT J48 outperformed NB with a Specificity rate of 82% but proved to have an unacceptable Sensitivity rate of less than 50%
	118	A system for predicting patients with the more common inveterate diseases	Indian chronic kidney disease dataset	CNN, KNN, NB, DT, and logistic regression	CNN and KNN achieved the highest accuracy with 95% The proposed technique should be applied using more than one dataset
Liver disease	115	A framework concentrated on the utilization of clinical data for liver disease prediction	Northeast of Andhra Pradesh, India	Logistic regression, KNN, DT, SVM, NB, and RF	Dependent on F1 measure: Logistic regression: 75% Naive Bayes: 53% Need to adopt other models to give higher accuracy
Multiple Disease Detection	101	A healthcare management system used by patients to schedule appointments with doctors and verify prescriptions	Datasets of diabetes, heart disease, chronic kidney disease, and liver	DT, RF, logistic regression, and NB	Logistic regression the highest accuracy with 98.5 % in the heart dataset Image datasets should be included to allow image processing of reports and the deployment of DL to detect diseases
	106	A prediction model that analyzes the user's symptoms and predicts the disease at an early stage	A total of 41 disorders were included as a dependent variable	DT, NB	All algorithms achieved the same accuracy score of 95.12% The authors noticed that overfitting occurred when all 132 symptoms from the original dataset were assessed instead of 95 symptoms. That is, the tree appears to remember the dataset provided and thus fails to classify new data. As a result, just 95 symptoms were assessed during the data-cleansing process, with the best ones being chosen
	111	A reliable prediction model for predicting lung cancer	Heart disease dataset and lung cancer dataset	SVM, genetic algorithms	Lung cancer Accuracy 81.8182% Using primitive tools The size, type, and source of data used is not mentioned
	123	LSTM approach to Performed multi-disease prediction for intelligent clinical decision support to predicting future disease diagnoses	A large clinical record dataset (over 5 million records) collected from a hospital in Southeast China	LSTM	The F1 score rises from 78.9 to 86.4%, respectively, with the state-of-the-art conventional and DL models, to 88.0 percent with the LSTM approach The authors stated that the model prediction performance may be enhanced further by including new input variables and that to reduce computational complexity, the method only uses one data source
	124	An approach introduced to predict the diabetes by creating a supervised ANN structure based on the subnets instead of layers	Iris and diabetes dataset	Multilayer perceptrons (MLPs) as well as LSTM	Proposed deep learning model achieved 97% accuracy This model is useless because not implement our model on large textual and image datasets
	125	A novel AI and Internet of Things (IoT) convergence-based disease detection model for a smart healthcare system	Diabetes and heart disease	CSO-LSTM	CSO-LSTM achieved an accuracy of 96.16% This method offered a greater prediction accuracy for heart disease and diabetes diagnosis, but there was no feature selection mechanism; hence, it requires extensive computational
	134	A DNN model to predict stroke death based on medical history and human behaviors utilizing large-scale electronic health information	Korean National Hospital Discharge In-depth Injury Survey (KNHDS) data from 2013 to 2016	DNN	The sensitivity, specificity, and AUC values were 64.32%, 85.56%, and 83.48%, respectively

Back to article page