U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Front Public Health

Early-Stage Alzheimer's Disease Prediction Using Machine Learning Models

1 Department of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai, India

Vinodhini Mani

S. r. srividhya, osamah ibrahim khalaf.

2 Al-Nahrain Nanorenewable Energy Research Center, Al-Nahrain University, Baghdad, Iraq

Carlos Andrés Tavera Romero

3 COMBA R&D Laboratory, Faculty of Engineering, Universidad Santiago de Cali, Cali, Colombia

Associated Data

Publicly available datasets were analyzed in this study. This data can be found at: https://www.kaggle.com/jboysen/mri-and-alzheimers?select=oasis_cross-sectional.csv .

Alzheimer's disease (AD) is the leading cause of dementia in older adults. There is currently a lot of interest in applying machine learning to find out metabolic diseases like Alzheimer's and Diabetes that affect a large population of people around the world. Their incidence rates are increasing at an alarming rate every year. In Alzheimer's disease, the brain is affected by neurodegenerative changes. As our aging population increases, more and more individuals, their families, and healthcare will experience diseases that affect memory and functioning. These effects will be profound on the social, financial, and economic fronts. In its early stages, Alzheimer's disease is hard to predict. A treatment given at an early stage of AD is more effective, and it causes fewer minor damage than a treatment done at a later stage. Several techniques such as Decision Tree, Random Forest, Support Vector Machine, Gradient Boosting, and Voting classifiers have been employed to identify the best parameters for Alzheimer's disease prediction. Predictions of Alzheimer's disease are based on Open Access Series of Imaging Studies (OASIS) data, and performance is measured with parameters like Precision, Recall, Accuracy, and F1-score for ML models. The proposed classification scheme can be used by clinicians to make diagnoses of these diseases. It is highly beneficial to lower annual mortality rates of Alzheimer's disease in early diagnosis with these ML algorithms. The proposed work shows better results with the best validation average accuracy of 83% on the test data of AD. This test accuracy score is significantly higher in comparison with existing works.

Introduction

Alzheimer's Disease (AD) is a progressive neurological condition that leads to short-term memory loss, paranoia, and delusional ideas that are mistaken for the effects of stress or aging. In the United States, this Disease affects about 5.1 million people. AD does not have proper medical treatment. In order to control AD, continuous medication is necessary. AD ( 1 ) is chronic so that it can last for years or the rest of your life. Therefore, it is most important to prescribe medication at the appropriate stage so that the brain is not damaged to a great extent. Early detection of this Disease is a tedious and costly process since we must collect a lot of data and use sophisticated tools for prediction and have an experienced doctor involved. Automated systems are more accurate than human assessment and can be used in medical decision support systems because they are not subject to human errors. Based on previous research on AD, researchers have applied images (MRI scans), biomarkers (chemicals, blood flow), and numerical data extracted from the MRI scans to study this Disease. As such, they were able to determine whether a person was demented or not. In addition to shortening diagnosis time, more human interaction will be reduced by automating Alzheimer's diagnosis. In addition, automation reduces overall costs and provides more accurate results. For example, we can predict whether a patient is demented by analyzing MRI scans and applying prediction techniques. If a person has early-stage Alzheimer's Disease, they are considered demented. By doing so, we can achieve better accuracy.

When a person has Alzheimer's Disease in the early stages, they can usually function without any assistance. In some cases, the person can still work, drive, and partake in social activities. Although this is the case, the person may still feel uneasy or suffer from memory loss, such as not remembering familiar words and locations. People close to the individual notice that they have difficulty remembering their names. By conducting a detailed medical interview, a doctor may identify problems with memory and concentration in the patient. Common challenges in early stage of Alzheimer's Disease include,

  • It's hard to remember the right word or name.
  • Having difficulty remembering names when meeting new people.
  • Working in social settings or the workplace every day can be challenging.
  • Having forgotten something that you have just read in a book or something else.
  • Having trouble finding or misplacing a valuable object.
  • Tasks and activities are becoming increasingly difficult to plan or organize.

Alzheimer's symptoms become more persistent as the Disease progresses. When people suffer from dementia, their ability to communicate, adapt to their environment, and eventually move is lost. It becomes much more difficult for them to communicate pain through words or phrases. Individuals may need substantial assistance with daily activities as their memory, and cognitive skills continue to decline. At this stage, individuals may:

  • Personal care and daily activities require 24/7 assistance.
  • The consciousness of their surroundings, as well as recent experiences, is lost.
  • As you age, you may experience changes in your physical abilities and walking, sitting, and eventually swallowing.
  • Communication with others is becoming increasingly difficult.
  • Infections, specifically pneumonia, become more prevalent.

Under the current conditions, human instinct and standard measurements do not often coincide. In order to solve this problem, we need to leverage innovative approaches such as machine learning, which are computationally intensive and non-traditional. Machine learning techniques are increasingly being used in disease prediction and visualization to offer prescient and customized prescriptions. In addition to improving patients' quality of life, this drift aids physicians in making treatment decisions and health economists in making their analyses. Viewing medical reports may lead radiologists to miss other disease conditions. As a result, it only considers a few causes and conditions. The goal here is to identify the knowledge gaps and potential opportunities associated with ML frameworks and EHR derived data.

Contribution

In our research work, people affected by Alzheimer's Disease are identified and we aims at finding individuals who potentially have Alzheimer's at an early stage. The datasets for Alzheimer's Disease is available on both OASIS and Kaggle which is used for training all patient's data using various machine learning algorithms such as SVM, Random Forest classifier, Decision tree classifier, XGBoost and Voting classifier to effectively distinguish the affected individuals with high degree of efficiency and speed. Finally, an overview of how the Disease has affected the population according to various criteria is analyzed.

Organization

Following are the different sections of our work: Section Related Works address the recent papers on detecting Alzheimer's Disease using Machine learning and Deep learning models. Section Materials and Methods discusses the exploratory data analysis, and different Machine learning classifier models. Results and Discussion section address the performance measures of different Machine Learning models. Finally Section Conclusion concludes the work and discusses the future work.

Related Works

Alzheimer's Disease is predicted using ML algorithms by using a feature selection and extraction technique, and the classification is conducted based on the oasis longitudinal dataset. The different techniques ( 2 ) involved in analyzing brain images for diagnosing diseases of the brain to provide a brief overview. Several major issues are discussed in this article relating to machine learning and deep learning-based brain disease diagnostics based on the results of the reviewed articles. The most accurate method of detecting brain disorders was found in this study and can be used to improve future techniques. Using machine learning and deep learning platforms, this study aims to combine recent research on four brain diseases: Alzheimer's disease, brain tumors, epilepsy, and Parkinson's disease. By using 22 brain disease databases that are used most during the reviews, the authors can determine the most accurate diagnostic method.

Martinez-Murcia et al. ( 3 ) uses deep convolutional autoencoders to explore data analysis of AD. Data-driven decomposition of MRI images allows us to extract MRI features that represent an individual's cognitive symptoms as well as the underlying neurodegeneration process. A regression and classification analysis are then performed to examine the distribution of the extracted features in a wide variety of combinations, and the influence of each coordinate of the autoencoder manifold on the brain is calculated. MMSE or ADAS11 scores, along with imaging-derived markers, can be used for over 80% accuracy to predict AD diagnosis.

A deep neural network is used with layers ( 4 , 5 )) that are all connected to perform binary classification. Each hidden layer is activated by a different activation function. A model with the best performance is chosen after k-folds validation Researchers at the Lancet Commission found that about 35% of Alzheimer's risk factors can be modified. The following factors can contribute to these risks: a lack of education, hypertension, obesity, hearing loss, depression, diabetes, lack of physical activity, smoking, and social isolation. Regardless of the impact of these factors at any stage of life, it is beneficial to eliminate them. Studies have suggested ( 6 ) that early intervention and treatment of modifiable Alzheimer's risk factors can prevent or delay 30% of cases of Alzheimer's ( 7 ). According to the Innovative Midlife Intervention for Alzheimer's Deterrence (In-MINDD) project ( 8 ) one way to calculate Alzheimer's risk based on risk factors is by using the Lifestyle for Brain Health (LIBRA) index ( 9 – 12 ). According to the National Academy of Medicine ( 13 , 14 ) cognitive training, hypertension management, and increased physical activity were the three main categories of dementia intervention. The most common type of Alzheimer's is Alzheimer's Disease (AD). Among the types of Alzheimer's, Vascular Alzheimer's (VaD) is the second most common, followed by Alzheimer's with Lewy bodies. A few other types of Alzheimer's are associated with brain injuries, infections, and alcohol abuse. Tatiq and Barber ( 15 ) in their study suggested that Alzheimer's can be prevented by targeting modifiable vascular risk factors because these two types often co-exist in the brain and share some modifiable risk factors. Williams et al. ( 16 ) obtained predictions of cognitive functioning based on neuropsychological and demographic data using four different models: SVM, Decision Tree, NN, and Naïve-Bayes. In this case, average values were substituted for the missing values; the accuracy of Naive Bayes was the highest. Data from ADNI study are applied using ten-fold cross-validation ( 17 , 18 ) and show high correlation between genetic, imaging, biomarker, and neuropsychological outcomes. MRI images from the OASIS dataset ( 19 , 20 ) are analyzed using voxel-based morphometry. Table 1 summarizes the recent work on prediction of Alzheimer's disease.

Summary of recent work related to AD.

Materials and Methods

The proposed approach consists of three basic steps. Firstly, the Alzheimer's disease dataset ( 24 – 26 ) was loaded into pandas for data preprocessing. This study utilized a longitudinal dataset, so a timeline of the study was necessary to gain further insight into the data. Our first step was to determine how cross-sectional the data appear to be, if either at a baseline or at a particular time. A complete analysis of the data was then conducted, including a comparison of the main study parts and the corresponding data collected during each visit. In this work, longitudinal MRI data is our primary data source. MRI data from 150 patients aged from 60 to 96 were included in the study. We scanned each patient at least once. Everyone is right-handed. Throughout the study, 72 of the patients were classified as “non-demented”. At the time of their initial visits, 64 patients were classified as being “Demented,” and they remained in this category throughout the study. Table 2 shows the dataset description of MRI data.

Dataset description.

The Machine Learning techniques ( 26 , 27 ) were applied to Alzheimer's disease datasets to bring a new dimension to predict Disease at an early stage. The raw Alzheimer's disease datasets are inconsistent and redundant, which affects the accuracy of algorithms ( 28 , 29 ). Before evaluating machine-learning algorithms, data must be effectively prepared for analysis by removing unwanted attributes, missing values, and redundant records. Building a machine-learning model requires splitting the data into training and testing sets. In the following data preparation step, the training data were used to create a model, which was then applied to test data to predict Alzheimer's Disease ( 28 , 30 , 31 ). The model was trained from training set data, and test set data were used to test unseen data. Cross-validation was carried out by dividing the dataset into three subsets. Model predictions are made using one subset of the data (test data) and model performance is evaluated using the other subsets (training and validation) of the data. The data had been preprocessed, and we randomly divided it into an 80:20 ratio, with 80% going to training and 20% gone to testing. Figure 1 describes the system workflow for predicting the Alzheimer's Disease at early stage ( 32 , 33 ).

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0001.jpg

Proposed workflow.

Data Preparation

Various data-mining techniques were used to clean and preprocess the data in this phase. As part of this, missing values are handled, features are extracted, and features are transformed, and so on. In the SES column, we identified 9 rows with missing values ( 34 , 35 ). This issue is addressed in two ways. The simplest solution is to drop the rows with missing values. The other way to fill in missing values is by Imputation ( 21 ), which refers to replacing them with their corresponding values. The model should perform better if we impute since we only have 140 measurements. The 9 rows with missing values are removed in the SES attribute and the median value is used for the imputation.

Data Analysis

We have discussed the relationships between each feature of an MRI test and dementia in this section. In order to formulate the relationship of data explicitly using a graph, we conducted this Exploratory Data Analysis process ( 36 , 37 ) to estimate the correlations before extracting data or analyzing it. The information could be used to interpret the nature of the data later on and to determine what method to use to analyze it. Table 3 shows the Min, Max and median values of each attribute.

Min, max, and median values of each attribute.

Feature Selection

Feature selection is very important in machine learning. In this work, feature selection is applied to the clinical data of Alzheimer's disease where we have thousands of samples. Feature selection ( 22 ) has three methods such as: Filter methods, Wrapper methods, and embedded methods. Filter method is a common method used in the stage of pre-processing. Wrapper methods is another method which core the feature subset. Finally, Embedded method combines the filter and wrapper methods.

The most common and popular feature selection methods are chosen in this work are Correlation coefficient, Information gain, and Chi-Square.

Correlation Coefficient

The covariance between the two variables X and Y is defined as

The covariance between two variables measures the linear relationship between them. Using correlation coefficients, it is easy to find a correlation between the various stages of Alzheimer's. The problem with this method is that the data is collected from a broad range of sources, so it becomes very sensitive to outliers.

Information Gain

The entropy of the lower node is subtracted from the entropy of upper node to obtain the Information gain value when the attribute D is selected.

Chi-Square: Using this method, we can examine categorical variables such as the relationship between food and obesity.

Preparation and Splitting the Data

Figure 2 shows the schematic representation of data splitting stage

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0002.jpg

Representation of data splitting.

  • Select Data: M.F, Age, EDUC, SES, MMSE, eTIV, nWBV, ASF, CDR
  • Train_Data < - round(0.8 * nrow(data)) #Select 80% of train data
  • TrainData_indices < - sample(1:nrow(data), Train_Data). #Vector is created with random indeces
  • TrainML < - data[TrainData_indices, ] #Training dataset is generated
  • SplitFormula < - CDR ~ M.F + Age + EDUC + SES + MMSE + eTIV + nWBV
  • Split < - nWayCrossValidation(nrow(data), n). #5-fold cross validation is generated

Classifier Models

Decision tree (dt).

An overview of the decision tree gives a tree-based model for dividing the data repeatedly based on the cutoff values of the features. Splitting creates subsets by separating instances into subsets. Intermediate subsets are referred to as internal nodes, while leaves are referred to as leaf nodes. A decision tree is most useful when there is significant interaction between the features and the target.

Random Forest (RF)

A random forest model performs better than a decision tree because it avoids the problem of overfitting. Models based on random forests consist of various decision trees, each slightly different from the others. Using the majority voting algorithm, the ensemble makes predictions based on each individual decision tree model (bagging). As a result, the amount of overfitting is reduced while maintaining the predictive ability of each tree.

Support Vector Machine (SVM)

This method involves determining the class of data points by appropriate hyper planes in a multidimensional space. By using SVM ( 25 ), we aim to find a hyperplane that separates cases of two categories of variables that take up neighboring clusters of vectors, one on one side, the other on the other side. Support vectors are those that are closer to the hyperplane. Training and test data are used in SVM. Training data is broken up into target values and attributes. SVM produces a model for predicting target values for test data.

XGBoost stands for eXtreme Gradient BOOSTting. It refers to the process of implementing gradient-boosted decision trees for maximum speed and performance. Due to the sequential nature of model training, gradient boosting machines are generally slow in implementation and not very scalable. XGBoost is focused on speed and performance.

Voting is one of the simplest ways of combining the predictions from multiple earning algorithms. Voting classifiers aren't actually classifiers but are more like wrappers for multiple ones that are trained and evaluated concurrently in order to benefit from their specific characteristics. We can train data sets using different algorithms and ensembles then to predict the final output. There are two ways to reach a majority vote on a prediction:

Hard voting: The simplest form of majority voting is hard voting. The class with the most votes (Nc) will be chosen in this case. Our prediction is based on the majority vote of each classifier.

Soft voting: This involves adding up the probability vectors for each predicted class (for all classifiers) and choosing the one that represents the highest value (recommended only when the classifiers are well calibrated).

Model Validation

Model validation reduces the overfitting problem. Cross Validation is done to train the ML model and are used to calculate the accuracy of the model. It is a challenging task to make the ML model from noise free. Hence, in this research work, Cross validation is performed which divides the whole dataset into n divisions which is of equal in size. The ML model is trained for every iteration with the n-1 divisions. The performance of the method is analyzed by the mean of all n -folds. In this work, the ML model was trained and tested 10 times by applying ten-fold cross validation to the model.

Results and Discussions

We evaluate various performance metrics like accuracy, precision, recall and F1 score. To determine the best parameters for each model, we perform 5-fold cross-validation: Decision Tree, SVM, Random Forests, XGBoost and Voting. Finally, we compare accuracy of each model. Several metrics and techniques were used to identify overfitting and parameter tuning issues after the models were developed. Performance evaluations can either be binary or multiclass and are described using the confusion matrix. A learning model was developed to distinguish true Alzheimer's disease affected people from a given population and a novel Machine Learning classifier was developed and validated to predict and separate true Alzheimer's disease affected people. The following evaluation measures were calculated using these components: precision, recall, accuracy, and F-score. Based on this study, recall (sensitivity) is the proportion of people accurately classified as having Alzheimer's. The precision of Alzheimer's diagnosis is the rate of people correctly classified as not having the disease. Alternatively, F1 represents the weighted average of recall and precision, while accuracy represents the proportion of people correctly classified. According to the results, the patient receives a report that tells him or her what stage of Alzheimer's Disease he or she is currently in. It is very important to detect the stages because the stages are based on the responses of the patients. In addition, knowing the stage helps doctors better understand how the Disease is affecting them. This research used these environments, tools, and libraries to conduct its experiments and analysis:

  • a) Environments Used: Python 3
  • b) Scikit-learn libraries for machine learning

The Figure 3 indicates that men are more likely than women to have dementia. Figure 4 that the non-demented group had much higher MMSE (Mini-Mental State Examination) scores than those with dementia.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0003.jpg

Analysis of demented and non-demented rate based on gender, Gender group Female = 0, Male = 1.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0004.jpg

Analysis of MMSE scores for demented and non-demented group of patients.

The Figures 5A–C shows the analyzed value of ASF, eTIV and nWBV for Demented and Non-demented group of people. As indicated by the graph in Figure 5 , the Non-demented group has a higher brain volume ratio than the Demented group. The reason for this is that the diseases influence the brain tissues causing them to shrink. Figure 6 shows the analyzed results of EDUC for Demented and Non-demented people.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0005.jpg

(A–C) Analysis of ASF, eTIV and nWBV for Demented and Non-demented group.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0006.jpg

Analysis on years of education.

Figure 7 shows the analysis on age attribute to find the percentage of people affected based on the demented and non-demented group. It is observed that a higher percentage of Demented patients are 70-80 years old than non-demented patients. It is likely that people with that kind of Disease have a low survival rate. Only a few people are over 90 years old.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0007.jpg

Analysis on people affected by demented and non-demented group based on age.

From the above all analysis on the attributes, the following are the summary on intermediate results.

  • It is more likely for men to have demented, or Alzheimer's Disease, than for women.
  • In terms of years of education, demented patients were less educated.
  • Brain volume in non-demented groups is greater than in demented groups.
  • Among the demented group there is a higher concentration of 70-80-year-olds than in the non-demented patients.

Table 4 shows the performance comparison of accuracy, precision, recall, and F1 score for different ML models. The performance measures are defined as,

Performance comparison of different ML models.

Accuracy: It is the measure of finding the proportion of correctly classified result from the total instances.

Precision: This measures the number of correctly predicted positive rate divided by the total predicted positive rates. If the Precision value is 1, it is meant as a good classifier.

Recall: Recall is a true positive rate. If the recall is 1, it is meant as a good classifier.

F1 Score: It is a measure which considers both Recall and Precision parameters. F1 score becomes 1 only when both the measure such as Recall and Precision is 1.

The most common metrics are the conversions of the True Positive (TP), the False Positive (FP), the True Negative (TN), and the False Negative (FN) metrics. Figures 8 – 13 shows the confusion matrix for Decision tree, Random Forest, SVM, XG boost, Soft, and Hard Voting classifier ML models.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0008.jpg

Confusion matrix for decision tree.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0013.jpg

Confusion matrix for hard voting classifier.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0009.jpg

Confusion matrix for random forest.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0010.jpg

Confusion matrix for SVM.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0011.jpg

Confusion matrix for XGBoost.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0012.jpg

Confusion matrix for soft voting classifier.

A comparison of training and testing accuracy has been conducted for each model to eliminate overfitting. For each model, precision, recall, accuracy, and F1-score are shown in Table 3 . Based on the analysis showed in the Table 3 , the results approved that the best and ideal techniques, which have a good performance, are random forest, and XGBoost. The accuracy value of Voting classifier model is also closer to the random forest, and XGBoost models. All the experimental results (the average accuracy, precision, recall, and F measure of each model) were collected for extra analysis. The comparative analyses among all the Machine Learning models in terms of accuracy, precision, recall, and F1 score are presented graphically in Figures 14 – 17 respectively.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0014.jpg

Comparison of accuracy.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0017.jpg

Comparison of F1 score.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0015.jpg

Comparison of precision.

An external file that holds a picture, illustration, etc.
Object name is fpubh-10-853294-g0016.jpg

Comparison of recall.

Conclusions

Alzheimer's is a major health concern, and rather than offering a cure, it is more important to reduce risk, provide early intervention, and diagnose symptoms early and accurately. As seen in the literature survey there have been a lot of efforts made to detect Alzheimer's Disease with different machine learning algorithms and micro-simulation methods; however, it remains a challenging task to identify relevant attributes that can detect Alzheimer's very early. The future work will focus on the extraction and analysis of new features that will be more likely to aid in the detection of Alzheimer's Disease, and on eliminating redundant and irrelevant features from existing feature sets to improve the accuracy of detection techniques. By adding metrics like MMSE and Education to our model, we'll be able to train it to distinguish between healthy adults and those with Alzheimer's.

Data Availability Statement

Author contributions.

CK: research concept and methodology and writing—original draft preparation. VM: review and editing. SS: supervision. OK and CT: validation. All authors contributed to the article and approved the submitted version.

This research has been funded by Dirección General de Investigaciones of Universidad Santiago de Cali under call No. 01-2021.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Disease Prediction using Machine Learning Algorithms

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Advertisement

Advertisement

Popular deep learning algorithms for disease prediction: a review

  • Published: 13 September 2022
  • Volume 26 , pages 1231–1251, ( 2023 )

Cite this article

disease prediction using machine learning research paper

  • Zengchen Yu   ORCID: orcid.org/0000-0003-1931-0810 1 ,
  • Ke Wang 2 ,
  • Zhibo Wan 1 ,
  • Shuxuan Xie 1 &
  • Zhihan Lv 3  

14k Accesses

23 Citations

1 Altmetric

Explore all metrics

Due to its automatic feature learning ability and high performance, deep learning has gradually become the mainstream of artificial intelligence in recent years, playing a role in many fields. Especially in the medical field, the accuracy rate of deep learning even exceeds that of doctors. This paper introduces several deep learning algorithms: Artificial Neural Network (NN), FM-Deep Learning, Convolutional NN and Recurrent NN, and expounds their theory, development history and applications in disease prediction; we analyze the defects in the current disease prediction field and give some current solutions; our paper expounds the two major trends in the future disease prediction and medical field—integrating Digital Twins and promoting precision medicine. This study can better inspire relevant researchers, so that they can use this article to understand related disease prediction algorithms and then make better related research.

Similar content being viewed by others

disease prediction using machine learning research paper

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

disease prediction using machine learning research paper

Artificial intelligence to deep learning: machine intelligence approach for drug discovery

disease prediction using machine learning research paper

Artificial intelligence in disease diagnosis: a systematic literature review, synthesizing framework and future research agenda

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, with the development of medical detection technology, a large amount of health data has been generated, which requires corresponding big data analysis methods to process these data and generate valuable information, which is helpful for disease diagnosis, personalized medicine and other medicine activities. Artificial intelligence (AI) and machine learning can be used to identify, analyze, predict and classify medical data [ 1 ], so in the past 10 years various AI algorithms have been effectively applied to process data generated in healthcare [ 2 , 3 ], such as applying logistic regression to heart disease prediction to achieve early detection of heart disease [ 4 ]. However, when the data reaches a certain level, the efficiency of traditional Machine Learning algorithms will be significantly reduced, that is, these Machine Learning algorithms lack certain big data analysis capabilities. And deep learning algorithms, namely deep neural networks (DNNs), can solve this problem. The DNN simulates the conduction of the human brain neural network (NN), and defines the input and output through complex layers composition. Each layer composition includes corresponding neurons and nonlinear functions (activation functions) [ 5 ]. Compared with traditional machine learning, the advantage of deep learning is that it can learn from the original data and has multiple hidden layers. It can learn abstract information based on input, process massive data and obtain high accuracy and performance. Therefore, it has been applied to the medical field by many scholars.

This article will divide deep learning into two types according to data types: structured data algorithms and unstructured data algorithms. Structured data algorithms include Artificial Neural Network (ANN) and Factorization Machine-Deep Learning (FM-Deep Learning), which can play a better role in processing structured medical record data. After the combination of FM and DNN, it can solve many problems that ordinary DNN cannot solve. FM is developed from the matrix factorization algorithm. Singular Value Decomposition (SVD), non-negative matrix decomposition and probability matrix decomposition are traditional matrix decomposition methods. They can decompose high-dimensional matrices into two or more low-dimensional matrices, which is convenient to study the properties of high-dimensional data in a low-dimensional space. These matrix factorization methods are widely used in prediction, recommendation and other fields because of their high scalability and good performance. However, traditional matrix factorization methods lack the effective use of context information. In this context, the FM model was proposed and popularized. FM was proposed by Rendle [ 6 ]. It is a supervised learning model [ 7 ], which combines the advantages of matrix decomposition and Support Vector Machine (SVM). Similar to SVM, the difference is that FM models pairwise feature interaction as the inner product of hidden vectors between features through matrix decomposition, so as to better mine feature interaction information, to reduce complexity, to solve sparsity and improve performance. FM was first applied to the Click-through Rate (CTR) predicton information behind the user’s click behavior. But in real-life data are often highly non-linear, so capture high-order feature interaction information can significantly improve performance. Although FM can theoretically model high-order feature interaction, it will cause parameter explosion and huge amount of calculation, resulting in significant increase in time complexity and storage space consumption. Therefore, only second-order feature interaction modeling is usually considered. If the high-order feature combination is performed manually, there are the following disadvantages: (1) experts in related fields need to spend a lot of time to study the correlation between features, which is time-consuming and laborious; (2) for large-scale prediction system, the amount of data is huge, and it is unrealistic to extract features manually; (3) it is impossible to generalize feature interactions that are not in the training set. Deep learning can automatically perform various combinations and nonlinear transformations on the input features, so as to learn high-order feature interaction information. Therefore, the combination of deep learning and FM can capture low-order to high-order features, and can better predict whether patients have diseases and disease types.

Unstructured data algorithms include Convolutional NNs (CNN) and Recurrent NNs (RNN), etc. This article will only explore the development of CNN and RNN and their applications in the medical field. CNN [ 8 ] is a DNN structure including convolutional computation, which has the ability of representation learning and can realize translation-invariant classification of input information according to hierarchical structure. CNNs generally include convolutional layers, batch-normalization layers, pooling layers, fully connected layers, etc. The core of which is the convolutional layer. The function of the convolution layer is to perform feature extraction on the input image. The convolution layer contains multiple convolution kernels. Each element that constitutes the convolution kernel has a corresponding weight coefficient and bias value, similar to the neurons of a feed-forward NN. Convolution calculation means that the convolution kernel slides on the image, and its corresponding elements are multiplied and summed with the covered image features. This process can achieve the effect of extracting local features and reducing parameters. Because the CNN can extract local features and reduce parameters (through weight sharing), it is particularly suitable for the field of image processing. Because there are a lot of image data in the medical field, the application range of CNN in the medical field exceeds that of other models. CNN can solve the problem of spatial dimension, but cannot process data in time dimension. The RNN [ 9 ] came into being, which consists of neurons and feedback loops. RNN has unique advantages for scenarios where the previous input and the next input have dependencies. Specifically, the network will remember the previous information and apply it to the current output calculation, that is, the nodes between the hidden layers are connected, and the input of the hidden layer includes not only the output of the input layer, but also the output of the hidden layer at the previous time. RNN can process time series data well, and is widely used in natural language processing, machine translation, speech recognition, image description generation, text similarity calculation and other fields.

This paper will explore the theories, development and disease application cases of these algorithms. Specifically, the contributions and characteristics of this paper are as follows:

According to the type of main processing data, the algorithm is divided into structured data algorithm and unstructured data algorithm.

CNN and RNN papers account for a high proportion in the field of in-depth learning, and papers on structured data processing methods are rare. Therefore, readers can understand the processing algorithms of structured data in detail through this article.

Different from the summary of classification according to disease types, this paper is classified according to the characteristics of algorithms. For example, in CNN’s disease application section, some paragraphs focus on transfer learning, some paragraphs focus on combinatorial algorithms, and some paragraphs focus on combining attention mechanism.

This paper probes into the problems existing in the current research of disease prediction, such as poor interpretability, unbalanced data, poor data quality and few samples in some cases, and gives the current feasible solutions.

The two major trends in future medical care, integrating Digital Twins and promoting precision medicine, are analyzed, indicating that deep learning disease prediction has a bright future.

This paper will help relevant researchers to understand the characteristics and development trends of related disease prediction algorithms, and ensure that they can purposefully select the most appropriate algorithm in the process of doing research.

Section  2 of this paper will introduce the theories, development and disease application cases of two kinds of structured data algorithms, ANN and FM-Deep Learning. Section  3 will introduce the theories, development and disease application cases of CNN and RNN. Section  4 will respectively introduce the current defects in the field of disease prediction algorithms and the coping strategies. Section  5 analyzes the two major trends of medical treatment in the future, that is, integrating Digital Twins and promoting precise medical treatment. Section  6 summarizes the full text.

2 Structured data algorithms

2.1 artificial neural network, 2.1.1 theory.

ANN consists of multiple layers, each layer has one or more artificial neurons. Each neuron receives one or more inputs. First, each input is multiplied by a network weight (network parameter), which is generally randomly initialized. Calculate the sum of all weighted inputs and deviation values of each neuron, and then input this value into the activation function (nonlinear variation function). Activation function is the core of NN. It introduces non-linearity into the network and makes it possible for the network to learn more complex functions. The output of the activation function is the output of neurons, and the output of each layer of neurons is used as the input of the next layer of neurons. In the iterative training process, the whole network will find the optimal weight distribution, and the loss function is used to measure whether the network weight is optimal. Figure 1 is a schematic diagram of a three-layer ANN. The whole network has an input layer, hidden layers (generally multiple) and an output layer. In practical application, the number of layers of the network will reach dozens or even hundreds of layers.

figure 1

Artificial neural network diagram

2.1.2 Disease application

Because the structure of ANN is relatively simple, it does not have the excellent characteristics of CNN and RNN, so there are few researches in this area [ 10 , 11 ]. Khanam and Foo [ 12 ] implemented a NN model for diabetes prediction, using 1, 2, and 3 hidden layers in the NN model and changing their epochs to 200, 400, and 800, respectively. Hidden layer 2 has 400 epochs and provides 88.6% accuracy, surpassing machine learning models such as Decision Tree, K-Nearest Neighbor (KNN), Random Forest, Logistic Regression, SVM, etc. In 2021, Soundarya et al. [ 13 ] used ANN to compare with machine learning models to detect Alzheimer’s Disease (AD) and found that ANN achieved the highest accuracy with sufficient data. Pasha et al. [ 14 ] used ANN to improve the prediction accuracy of cardiovascular disease. When dealing with large datasets, traditional machine learning models do not perform well, while ANN can play an advantage. These all indicate that ANN is one of the future trends, and deep learning represented by ANN will become the mainstream algorithm for disease prediction.

2.2 FM-deep learning

2.2.1 theory.

To capture second-order interactions between features, a second-order cross term is usually added to the linear regression formula:

There are \({\text {n}}({\text {n}}-1)/2\) parameters in the second-order intersection part, but when finding \(w_{ij}\) , it is necessary that the features \(x_{i}\) and \(x_{j}\) are not 0 at the same time, and the sparse data (especially after one-hot code) satisfies that \(x_{i}\) and \(x_{j}\) are not 0 at the same time There are few cases, so there are fewer samples of corresponding feature interactions in the training set, resulting in inaccurate learned \(w_{ij}\) and over-fitting. In order to solve this problem, FM decomposes \(w_{ij}\) into hidden vectors vi and \(v_{j}\) , that is, \(w_{ij}=\langle v_{i}, v_{j}\rangle\) , where \(v_{i}\) =( \(v_{i1}\) , \(v_{i2},\ldots ,v_{ik}\) ) (k is a hyper-parameter, indicating the length of the hidden vector). The matrix W composed of \(w_{ij}\) can be expressed as follows:

Now there are n * k binomial parameters, far less than the original number of \(w_{ij}\) .

Why do we say that hidden vectors can solve data sparsity? Because all samples containing non-zero feature combinations of \(x_{h}\) can be used to learn \(v_{h}\) . For example, the parameters of \(x_{h} x_{i}\) and \(x_{h} x_{j}\) are \(\langle v_{h}, v_{i}\rangle\) and \(\langle v_{h}, v_{j}\rangle\) , respectively. They have a common item \(v_{h}\) , so the value of \(v_{h}\) can be estimated reasonably. This can greatly reduce the impact of data sparsity.

The implicit vector mechanism can also increase the generalization of the model. According to the principle that FM can solve sparsity, when FM learns the embedded hidden vector weight of a single feature, it does not depend on whether a specific feature combination has occurred. For the feature combination \(x_{i} x_{j}\) that has never appeared before, as long as FM learns the hidden vectors corresponding to \(x_{i}\) and \(x_{j}\) , the weight of this feature combination can be calculated through the inner product, so FM has strong generalization ability. The formula of FM is as follows [ 15 ]:

It can be seen that the complexity of FM is O( \(n^{2}k)\) , and its complexity can be reduced to O(n * k) by the following steps:

The final FM equation is:

In fact, the essence of FM is embedding plus interaction, by assigning each feature \(x_{i}\) (discrete features will be one-hot encoded before) a implicit vector \(v_{i}=(v_{i1}, v_{i2}, v_{i3}\) , \(v_{i4}\) ) (assuming here k = 4), change the original high-dimensional data into a low-dimensional dense vector e through the embedding layer, that is, multiply \(x_{i}\) by the corresponding hidden vector \(v_{i}\) to obtain \(e_{i}\) , as shown in Fig. 2 .

figure 2

Embedding of feature \(x_{i}\)

The entire Embedding layer is shown in Fig. 3 :

figure 3

Embedding layer of FM

In summary, the overall structure of FM can be drawn, as shown in Fig. 4 , where \(y_{Linear} = w_{0} + \sum _{i = 1} ^{n} w_{i} x_{i}\) , \(y_{FM2} = \frac{1}{2} \sum _{f = 1} ^{k} \left( \left( \sum _{i = 1} ^{n} v_{if} x_{i}\right) ^{2} - \sum _{i = 1} ^{n} v_{if} ^{2} x_{i} ^{2} \right)\) .

figure 4

Overall structure diagram of FM

2.2.2 Development history

In 2016, Zhang et al. [ 16 ] proposed a FM Supported NN (FNN). The model uses a DNN with embedded layers to complete the CTR prediction, which obtains the dense vector of each feature through pre training the FM model. Then all embedded vectors of the sample are spliced and input to DNN for training. The feature of FNN is that the embedding vector of each feature is trained by FM model in advance. Therefore, when training DNN model, the overhead is reduced and the model can converge faster. However, the performance of the whole network is limited by the performance of FM. In the same year, Qu et al. [ 17 ] introduced a product layer between the embedding layer and the fully connected layer to propose Product-based Neural Network (PNN). PNN finds the relationship between features through inner product or outer product between features, but it lacks low-order feature interaction, so it may ignore the valuable information contained in the original vector. He et al. studied the recommendation problem in the case of sparse input data, and proposed Neural FM (NFM) [ 18 ]. NFM adopts a framework similar to Wide&Deep [ 19 ], and it uses Bi-Interaction Layer (Bi-linear interaction) structure to process the second-order cross information, so that the information of the cross features can be better learned by the DNN structure, reducing the difficulty of the DNN learning higher-order cross feature information. In order to learn low-level feature interaction, Guo et al. [ 20 ] proposed DeepFM, which combined Deep and FM, used FM for low-level interaction of features, and DNN for high-level feature interaction, combining the two methods in parallel. And both parts share the same input. The final first-order features and second-order and higher-order feature interactions are simultaneously input to the output layer, and the whole process does not require pre-training and feature engineering. He et al. proposed Attention FM (AFM) [ 21 ] by extend NFM. They introduce the attention mechanism into the Bi-Linear interactive pooling operation, which further improved the representation ability and interpretability of NFM. AFM only adds an attention mechanism on the basis of FM, and the quadratic term does not enter the deeper network, so AFM does not take advantage of DNN. Zhang et al. [ 22 ] combined DeepFM and AFM, and proposed Deep AFM (DeepAFM), which combined the AFM and deep learning in a new NN structure for learning. Compared with existing deep learning models, this method can effectively learn the weighted interaction between features without feature engineering by introducing the feature domain structure. There are also many explorations of attention mechanism. Zhang et al. [ 23 ] proposed a new model FAT-DeepFFM, which dynamically captures the importance of each feature before the explicit feature interaction process by introducing CENet domain attention, thus enhancing the DeepFFM. Tao et al. [ 24 ] proposed Higher-order AFM (HoAFM), by explicitly considering the interaction of high-order sparse features, they designed a cross interaction layer, updated the representation of features by aggregating the representation of other co-occurrence features, and implemented a bit by bit attention mechanism to determine the different importance of co-occurrence features in dimensional granularity. Yu et al. [ 25 ] proposed Gated AFM (GAFM) based on dual factors of accuracy and speed, using the structure of gates to control speed and accuracy. Wen et al. [ 26 ] proposed Neural Attention Model (NAM), which deepens the FM by adding fully connected layers. Through the attention mechanism, NAM can learn the different importance of low-order feature interactions. By adding fully connected layers on top of the attention component, NAM can model higher-order feature interactions in a non-linear fashion. In 2019, Yang and colleagues [ 27 ] proposed Empirical Mode Decomposition and FM based NN (EMD2FNN). Empirical mode decomposition helps to overcome the non stationarity of data, and the FM helps to master the nonlinear interaction between inputs. Zhang et al. [ 28 ] proposed High-order Cross-Factor FM (HCFM). They designed Cross-Weight Network (CWN) to achieve high-order display interactions. The cross and compression layers of CWN are designed to effectively learn important feature combinations, and the weight pooling layer aims to learn the weights of different interaction orders to balance the different weights between high-order and low-order feature interactions. Lu et al. [ 29 ] proposed Dual-Input FMs (DIFM), which can efficiently and adaptively learn different representations of a given feature according to different input instances, and can efficiently learn input-aware factors simultaneously at the bit-wise and vector levels (using for re-weighting the original feature representation). The DIFM strategically integrates various components including multi-head self-attention, residual networks, and DNN into a unified end-to-end model. Deng et al. [ 30 ] proposed a new Deep Field-weighted FM (DeepFwFM), which itself combines FwFM components and ordinary DNN components, shows unique advantages in structure pruning, using this combination can greatly reduce inference time. Yu et al. [ 31 ] proposed Neural Pairwise Ranking FM (NPRFM), which integrates a multilayer perceptual NN into Pairwise Ranking Factorization Machine model. Specifically, to capture higher-order and nonlinear interactions between features, a multi-layer perceptual neural network is superimposed on a double-interaction layer to encode the second-order interactions between features. Pande [ 32 ] proposed Field Embedding FM (FEFM) and Deep FEFM (DeepFEFM). FEFM learns the symmetric matrix embedding of each field pair and the single vector embedding of each feature. DeepFEFM combines the FEFM interaction vector learned by FEFM components with DNN to learn high-order feature interaction. Qi and Li [ 33 ] proposed Deep Field-Aware Interaction Machine (DeepFIM) to solve the “short expression” problem and better capture multi-density feature interactions. They proposed a new feature interaction expression based on field identifier, namely “hierarchy expression”. On this basis, they designed a cross interaction layer to identify field and field interaction, and used attention mechanism to distinguish the importance of different features. A dynamic bi pool layer is introduced to enhance the acquisition of high-order features.

There is also a combination of FM and CNN. Zhang et al. [ 34 ] proposed Deep Generalized Field-aware FM (DGFFM), which uses a wide-deep framework to jointly train Generalized Field-aware FM (GFFM) and DenseNet. It aims to combine the advantages of traditional machine learning methods, including their faster learning speed for low-rank features and the ability to extract high-dimensional features, where GFFM can significantly reduce computation time by exploiting the corresponding positional relationship between field indices and feature indices. Chanaa and El Faddouli [ 35 ] proposed Latent Graph Predictor FM (LGPFM), which utilizes CNN to capture interaction weights for each pair of features. LGPFM combines the advantages of FM and CNN, and CNN can work efficiently in the grid topology, which will significantly improve the accuracy of the results.

Metric learning can also be combined with the FM algorithm. Guo et al. [ 36 ] proposed an FM framework based on generalized metric learning technology. The metric method based on Mahalanobis distance uses semi positive definite matrix to project features into a new space, so that the features obey certain linear constraints. The distance function based on DNN is designed to capture the nonlinear feature correlation, which can benefit from the strong representation ability of metric learning method and NN. At the same time, a learnable weight is introduced for the interaction of each attribute pair, which can greatly improve the performance of the distance function.

2.2.3 Disease application

Chen and Qian [ 37 ] proposed NN and FM for the diagnosis of children’s sepsis. NN can better process the test index result value of patients, and FM can better process the test index state data of patients with sparse structure. Ronge et al. [ 38 ] developed a deep FM model for AD diagnosis, which consists of three parts: an embedding layer that handles sparse categorical data, Factorization Machine that efficiently learns pairwise interactions, DNN that implicitly model higher-order interactions. The above are simple combinations of NN and FM, and the FM-Deep Learning algorithms with better performance mentioned in Section 2.2.2 are not used. While Fan et al. [ 39 ] applied DeepFM to predict the recurrence of Cushing’s disease after transsphenoidal surgery, predicted the recurrence of 354 patients with initial postoperative remission in Peking Union Medical College Hospital, and obtained the highest AUC value (0.869) and the lowest logistic loss value (0.256), which exceeded other models.

3 Unstructured data algorithm

3.1 convolutional neural network, 3.1.1 theory.

CNN is particularly suitable for learning image features. Before CNN was proposed, the fully-connected network was generally used to extract image features, but the entire fully-connected network often had a particularly large number of connections, which would lead to an explosive increase in the number of parameters and training time. It can be noted that it is not necessary for each neuron to perceive the entire image, the image has a strong 2D local structure, that is, spatially adjacent variables (or pixels) are highly correlated. So people put forward the concept of CNN, which combines three ideas: local receptive field, shared weight and down sampling. The size of convolution kernel is called receptive field. The convolution kernel slides on the image and extracts the features of its coverage area, which can achieve the purpose of forcibly extracting local features, and extract visual features such as edges and corners. Because each region of the image is scanned by a convolution kernel with the same weight, the weight sharing is realized and the number of parameters is greatly reduced. Therefore, the convolution layer of CNN can extract local features well and reduce the number of parameters.

CNN also includes batch-normalization layers, activation layers, and pooling layers. The batch-normalization layer standardizes the small batch data to make it conform to the standard normal distribution, and performs scaling and migration operations, which effectively avoids the disappearance of the gradient, speeds up the decline of the gradient and accelerates the convergence. The activation layer non-linearly processes the input through the activation function, which enables the whole NN to fit any function. The formula is as follows:

Here a is the activation function, x is the input, and both w and b are weight parameters.

Figure 5 is a simple schematic diagram of CNN.

figure 5

Convolutional neural network diagram

3.1.2 Development history

In 1989, LeCun et al. [ 40 ] designed CNN with two convolutional layers (with convolution kernel size of \(5\times 5\) ), trained on the handwritten zip code dataset of the United States Post Office, and the generalization performance of the model reached best at the time. This network is actually the prototype of LeNet, but the whole network only has convolution layer and full connection layer. In 1998, LeCun et al. [ 41 ] formally put forward LeNet5, which includes convolution layer, pooling layer and full connection layer. There are seven layers in total. The convolution layer uses \(5\times 5\) convolution kernels, and the activation function uses sigmoid. LeNet-5 has a total of 340,908 connections, but the number of trainable parameters is reduced to 60,000 due to weight sharing. After LeNet-5 was proposed, the research of CNN in speech recognition, object detection, face recognition and other application fields has gradually been carried out. After 2012, the CNN entered the stage of large-scale application and in-depth research. The sign was that Krizhevsky et al. [ 42 ] proposed AlexNet-8, and its ImageNet Top5 error rate reached 15.3% in the 2012 ILSVRC competition. AlexNet-8 consists of five convolutional layers, which are filled with all zeros and use ReLU as the activation function. Some convolutional layers are followed by a maximum pooling layer, which can better extract feature textures. AlexNet also uses Dropout to prevent over-fitting. Simonyan and Zisserman [ 43 ] proposed VGGNet-16 and VGGNet-19, which used a small convolution kernel ( \(3\times 3\) receptive field), which improved the recognition accuracy while reducing parameters. VGGNet also adds a batch-normalization layer to speed up the training process, and its number of layers exceeds the previous network, reaching 16–19 layers, which can better learn sample features. The entire network structure is regular and suitable for parallel acceleration. In the 2014 ILSVRC competition, VGGNet reduced the ImageNet Top5 error rate to 7.3%. In the same year, InceptionNet, that is, GoogleNet, was proposed [ 44 ], with a depth of 22 layers, and using convolution kernels of different sizes in one layer to improve the perception of the model. InceptionNet uses a \(1\times 1\) convolution kernel to change l output features. The number of channels of the graph (can reduce network parameters). Its ImageNet Top5 error rate was reduced to 6.7%.

Although the depth increase is the development trend of CNN, the gradient will disappear as the number of layers increases to a certain extent. At this time, the accuracy of the depth learning model reaches saturation, and then the training error and test error will decrease significantly, resulting in the inability of the model to converge. So in 2015, the Kaiming He team [ 45 ] proposed Residual NN (ResNet), which is connected by residual skip connections between layers, which is mainly to add several identity mapping layers (input equal to output) after some layers, In this way, the forward information can be introduced, which can suppress the disappearance of the gradient, which enables the number of layers of the NN to exceed the previous constraints, reaching hundreds of layers and improving the accuracy. ResNet evaluated on the ImageNet dataset are 152 layers deep-8 times deeper than VGGNet, but still less complex. In addition, the model also uses a global pooling layer to replace the fully connected layers, which can also achieve the purpose of reducing parameters.

3.1.3 Disease application

Acharya et al. [ 46 ] were the first to use CNN for Electroencephalogram (EEG) signal analysis. In this work, the authors implement a 13-layer CNN to detect normal, preictal and epileptic seizure categories without separate feature extraction and feature selection steps. Muhammad et al. [ 47 ] proposed CNN-based fusion model for EEG pathology detection. Hossain et al. [ 48 ] uses Deep Learning techniques for Epilepsy Seizure Detection. Chanu and Thongam [ 49 ] proposed a computer-aided 2D cellular neural network classification technique to classify MR images into two categories: normal and tumor. This method is suitable for inclusion in clinical decision support systems for the initial diagnosis of brain tumors by clinical experts. In 2022, Seven et al. [ 50 ] used the deep learning of Endoscopic Ultrasonography (EUS) images to predict whether the malignant potential of gastrointestinal stromal tumors. First let the EUS image be resized in \(28 \times 28 \times 1\) format through Lanczos interpolation. The deep learning part uses 20 CNN kernels for the first layer and 50 for the second layer. After each kernel layer, the image resolution is halved. After these convolutional processes, the feature image information is put into the ANN model to train the AI system. The results show that the AI of deep learning based on EUS images can predict the malignant potential of gastric stromal tumors with high accuracy. Yin [ 51 ] constructed two 50-layer ResNets based on different building blocks to classify skin lesion images. Although these studies have no major innovations, they have exerted the unique image feature extraction ability of CNN and achieved good results. Rahman et al. uses CNN with relevant adversarial examples (AEs) for COVID-19 diagnosis [ 52 ].

Transfer learning refers to transferring the parameters of the trained model (pre training model) to a new model to help train the new model. Because transfer learning can ensure that the model has a higher starting point (before fine tuning, the initial performance of the model is higher), a higher slope (during the training process, the promotion rate of the model is faster) Higher asymptotic (the model converges better after training), so it often plays a role in the field of disease prediction in combination with CNN. In 2019, Amin et al. [ 53 ] proposed a new method to classify tumor/non-tumor Magnetic Resonance Images (MRI), where the segmented images are fed to a pre-trained CNN model where feature learning is performed by AlexNet and GoogleNet. Fully connected layer are used for feature mapping and score vectors are obtained from each trained model. In addition, the score vector is provided to the softmax layer and multiple classifiers. In 2020, Wang et al. [ 54 ] proposed two CNN models, which can automatically distinguish benign and malignant masses, lipomas, benign schwannomas and vascular malformations by learning image features. The author chose VGGNet-16 architecture pre-trained on ImageNet dataset to build two CNN models, so as to improve performance by using transfer learning and DNN architecture. Chelghoum et al. [ 55 ] used nine pre-trained deep networks, including AlexNet, GoogleNet, VGG-16, VGG-19, ResNet-18, ResNet-50, ResNet-101, ResNet-Inception-V2, and SENET to solve the problem of brain tumor classification by using transfer learning method. The results show that when the number of training samples is small and the number of iterations is small, the performance of the model is still good and the time consumption can be reduced. Similar to the research of Chelghoum et al., Kaur and Gandhi [ 56 ] also explored different pre trained classical CNN models to explore the transfer learning ability in pathological brain image classification. The author uses various pre trained DCNN, namely AlexNet, ResNet-50, GoogleNet, VGGNet-16, ResNet-101, VGGNet-19, Inception V3 and Inception ResNet V2. The last layers of these models are replaced to adapt to the training set. Compared with other models, AlexNet shows the best performance in a shorter time. Rehman et al. [ 57 ] also aimed at the problem of brain tumors, combined with the traditional machine learning model, adopted three classical CNNs (AlexNet, GoogleNet and VGGNet) to classify brain tumors such as meningioma, glioma and pituitary tumor. The author took these three CNNs as pre-training models and used their different freezing layers respectively. Finally, SVM is used for classification. The results show that the fine tuned VGGNet-16 architecture achieves the highest accuracy in classification and detection, reaching 98.69%. Kumar and Nandhini [ 58 ] adopted the entropy image slicing method to select the most informative MRI slices during the training phase. Transfer learning training was performed on the ADNI dataset, and the VGGNet-16 network was used to classify AD of normal individuals. By introducing the MRI slice method, the model can effectively reduce the preprocessing complexity, and use the VGG-16 network transfer learning technique to solve the unreliability problem. Extracting the parameters of the pre-training model for processing is also one of the methods of transfer learning. Tsai and Tao [ 59 ] trained the deep Convolution NN model, and extracted the modified parameters in the network layer to identify the abundant different tissue types in the histological images of colorectal cancer. Eweje et al. [ 60 ] utilized a deep learning approach combining conventional MRI images and clinical features to develop a model to classify the malignancy of bone lesions. The method consists of three parts: (1) Imaging data model: By adopting the EfficientNet deep learning architecture, an image classification model is developed. EfficientNet models initialized with weights pre-trained on the ImageNet database can extract features from imaging data. (2) Clinical data model: logistic regression model using clinical variables. Inputs are patient age, gender, and lesion location. For 21 locations (clavicular, skull, proximal femur, distal femur, foot, proximal radius, distal radius, proximal ulna, distal ulna, hand, hip, proximal humerus, distal humerus, proximal tibia end, distal tibia, proximal fibula, distal fibula, mandible, rib/chest wall, scapula, or spine) were one-hot encoded so that the model received 23 different input variables for data quantification. (3) Ensemble model: (1) and (2) are combined using a stacking ensemble approach, where the voting ensemble receives as input the malignancy probability from the imaging and clinical feature models and creates an output based on the sum of the predicted probabilities.

Previously, Rehman et al. combined AlexNet, GoogleNet, and VGGNet with traditional machine learning models, and achieved good results, but if two different deep learning models can be combined, better results can be achieved. In 2021, Kokkalla et al. [ 61 ] proposed a deep dense initial residual network model for the three-class classification of brain tumors, which customized the output layer of inception ResNet V2 with fully connected networks and softmax layer. In the same year, Ning et al. [ 62 ] proposed an automatic Congestive Heart Failure (CHF) detection model based on a hybrid deep learning algorithm of CNN and Recursive Neural Network. Normal sinus heart rate signals and CHF signals were classified according to ECG and time spectrum. The author carries out feature extraction of ECG signal, mainly extracts RR interval sequence, calculates the time spectrum of ECG signal, and uses CNN to automatically identify the spectrum and related features crossed with time domain. Srinivasu et al. [ 63 ] introduced MobileNet V2 with LSTM components to accurately classify skin diseases from images captured from mobile devices. MobileNet V2 is used to classify skin disease types, and LSTM is used to enhance the performance of the model by maintaining state information of features encountered in previous generation image classification.

The attention mechanism can assign different weights to the input features, so that the model can focus on more important features and information. Therefore, some scholars combine the attention mechanism with CNN for disease prediction. Toğaçar et al. [ 64 ] proposed a deep learning model BrainMRNet for brain cancer detection. BrainMRNet is a feedforward end-to-end convolution model, including super column technology, attention module and residual block. Using the super column technology, the features of the input image extracted through the convolution layer of each pixel are combined through the super vector, and the most effective features in the vector are selected and transferred to the next layer; Through the attention module, BrainMRNet attracts attention to the important areas of input data, while unnecessary areas are ignored, which can increase the verification success rate of BrainMRNet; the whole model is composed of residual blocks, which can improve the performance of the model by updating the weight parameters of back propagation. Metric learning is also called similarity learning, which is to classify by comparing the similarity between samples. Some scholars combine CNN with metric learning. Jiao et al. [ 65 ] adopted a deep distance metric to learn breast mass classification. The model contains convolutional layers and metric layers. Firstly, the model trains and fine tunes the level of CNN. The CNN structure can provide a good depth feature extraction network and a baseline for breast mass classification. Then, the large edge metric learning method with hinge loss is used to initialize the ensemble learning layer, and the ensemble learning layer is trained to make the characteristics of different breast masses more separable. The metric layers benefits from the representative characteristics of the convolutional layers, and the data flow between them is limited by one-way transmission. The relationship between the two layers is similar to the parasitic relationship in biology/ecology. Therefore, the proposed method is called parasitic metric learning network.

Shallow CNNs can reduce spatial and temporal constraints. Tripathi and Singh [ 66 ] proposed a hybrid, flexible deep learning architecture, OLConvNet, which combines the interpretability and depth of traditional object-level features by using a shallower CNN named CNN3L. Extract DL features from the original input image. Then the two sets of features are fused together to generate the final feature set. Multilayer perceptron uses the final fused feature set as input to classify the histopathological nuclei into one of four categories.

Although CNN is mainly used in the image field, some scholars also apply it to structured medical record data and speech data. In 2016, Cheng et al. [ 67 ] proposed a deep learning method for phenotypic analysis from patients’ Electronic Medical Records (EHR). Firstly, the EHR of each patient is expressed as a time matrix, with time in one dimension and events in another dimension. Then a four layer Convolution NN model is established for phenotypic extraction and prediction. The first layer consists of these EHR matrices. The second layer is a unilateral convolution layer from which the phenotype can be extracted. The third layer is the largest aggregation layer that introduces sparsity to the detected phenotypes, so as to retain only those significant phenotypes. The fourth layer is the fully connected softmax prediction layer. In order to integrate the temporal smoothness of patients’ EHR, the author also studied three different temporal fusion mechanisms in the model: early fusion, late fusion and slow fusion.

In 2019, Gunduz [ 68 ] proposed two frameworks based on CNNs to classify Parkinson’s Disease (PD) using sound (speech) feature sets. Although the two frameworks are used to combine various feature sets, they are different in combining feature sets. The first framework combines different feature sets and provides them as input to 9-layer CNN, while the second framework transfers the feature sets to the parallel convolution layer. The second framework can learn deep features from each feature set through parallel convolution layer. The extracted deep features can not only successfully distinguish patients with PD from healthy people, but also effectively enhance the discrimination ability of the classifier.

In 2020, Sajja and Kalluri [ 69 ] proposed a CNN to predict whether a patient has heart disease. The convolutional architecture adopted by the authors consists of two convolutional layers, two Dropout layers, and an output layer. The model predicts disease with 94.78% accuracy on the UCI-ML Cleveland dataset, outperforming logistic regression, KNN, Naive Bayes, SVMs, and NNs. This is also an application of CNN to structured data.

3.2 Recurrent neural network

3.2.1 theory and development.

RNN [ 70 ] is used for pattern recognition of streaming or sequential data such as speech, handwriting and text. There is a circular connection in the hidden layer of RNN. The RNN performs cyclic calculation in the cyclic connection of these hidden units to process the input data in sequence. Each previous input data is stored in a state vector in the hidden unit, and these state vectors are used to compute the output. In summary, RNN calculates a new output considering the current input and the previous input. Although RNN has good performance, in the back-propagation of RNN, when calculating the gradient adjustment weight matrix, due to many partial derivatives multiplied continuously, the gradient in the network will become very small and gradually disappear, or become too large, which makes it difficult for RNN to learn long-distance information. In order to solve this problem, some scholars proposed long short-term memory (LSTM) network [ 71 ], which can store sequence data for a long time and solve the problem of gradient disappearance. As shown in the upper part of Fig. 6 , LSTM uses a gating mechanism and introduces an input gate, a forget gate and an output gate. When the gate is closed, it will prevent changes to the current information, so that the previous dependency information will be learned; when the door is open, it does not completely replace the previous information, but makes a weighted average between the previous information and the current information. Therefore, no matter how deep the network is and how long the input sequence is, as long as the door is open, the network will remember these input information. The input gate controls the information of the current word to be integrated into the cell state. The current cell state integrates the information of the current word and the cell state of the previous moment, and represents the long-term memory. The input gate determines how much information about the current word will be stored in the current cell state. The forget gate controls the information of the cell state at the previous moment to be integrated into the current cell state. When understanding a sentence, the current word may continue to describe the meaning above, or it may start to describe new content from the current word, which has nothing to do with the above, so it is necessary to do the corresponding forgetting operation. The forget gate is responsible for selectively forgetting the information of the cell state. The output gate is responsible for selectively outputting the cell state information. Gated Recurrent Unit, GRU [ 72 ] is a simplified version of LSTM. As shown in the lower part of Fig. 6 , GRU changes the original three gates into two gates—update gate and reset gate. The reset gate is used to control the influence of the hidden layer state at the previous moment (representing the past information) on the current word. The update gate is a merger of the forget gate and the input gate in LSTM, and is responsible for assigning the importance of past and present information. In this way, the structure of GRU is simpler and matrix operations are less in calculation. Therefore, GRU can save more time than LSTM in the case of large training data.

figure 6

LSTM and GRU structure diagram. upper: LSTM; lower: GRU

3.2.2 Disease application

RNNs with LSTM hidden units, pooling, and word embeddings are used in DeepCare [ 73 ], an end-to-end deep dynamic network that infers current disease states and predicts future medical outcomes, the authors also conditioned LSTM cell with decay effect to handle irregularly timed events. In 2018, Chu et al. [ 74 ] proposed a new context-aware attention mechanism for detecting Adverse Medical Events (AME) of cardiovascular diseases to learn the local context information of words in medical texts. The attention mechanism enables the keywords related to the target AME to get more attention signals, and then drives the model to locate prominent parts of medical texts. The proposed neural attention network is combined with the standard Bi-LSTM model to detect AMEs from a large number of EHR data. The combination of global order-dependent signals of words captured by standard Bi-LSTM and local context signals of words captured by context attention mechanism can significantly improve the performance of AME detection in medical texts.

Some scholars use LSTM for Electrocardiogram (ECG) signal processing. In 2018, Tran et al. [ 75 ] proposed a feature extraction-based method to process ECG signals from Internet of Things (IoTs)-specific devices, employing an Auto-Encoder (AE) model to reduce data dimensionality, by combining LSTM extracts top ECG features. Finally, the full connection layers were used to distinguish normal ECG from abnormal ECG.

Some medical record data with time characteristics (i.e. serialized data) can also be analyzed by LSTM. In 2018, Reddy and Delen [ 76 ] used RNN–LSTM method to predict the readmission probability of lupus patients within 30 days by extracting the time relationship from longitudinal EHR clinical data. RNN–LSTM method can make use of the relationship between patients’ disease state and time, which makes the model have higher performance. In 2019, Wang et al. [ 77 ] used LSTM to predict 6-month, 1-year and 2-year mortality in dementia patients. The deep learning model proposed by the authors consists of two stacked LSTM layers and two attention layers: one between the input layer and the LSTM layer, and the other between the LSTM layer and the output layer. Stacked LSTM layers support hierarchical abstraction of the input data. Attention layers are used to improve model performance as well as keep track of the importance of temporal inputs as the model makes predictions.

There are also several examples of GRU applications. There are also several application cases of GRU. In 2017, Choi et al. [ 78 ] used GRU for heart failure diagnosis. Compared with popular methods such as logistic regression, Multi-Layer Perception (MLP), SVM and KNN, GRU performed well in heart failure diagnosis. The results show that the deep learning model suitable for using time relationship improves the performance of the model for detecting sudden heart failure in a short observation window of 12–18 months. Choi et al. [ 79 ] used RNN with GRU to develop doctor AI, an end-to-end model that uses patient history to predict subsequent diagnosis and drug treatment.

Some scholars have proposed that RNN is lighter than CNN and it can also be used for image processing. In 2020, Amin et al. [ 80 ] proposed an automatic classification method for brain tumors based on LSTM of MRI. First, N4ITK of size 595 and Gaussian filter are used to improve the quality of multi-sequence MRI. The classification is performed using the proposed four-layer deep LSTM model. In each layer, 200, 225, 200 and 225 are selected as the optimal number of hidden units, respectively. The lightweight four layer LSTM model proposed by the author has achieved better results in temporal data processing, which is conducive to the learning of multi sequence MRI.

4 Existing defects and solutions

Here we list several problems in current disease research, which will correspondingly affect the diagnosis rate of disease prediction algorithms. These problems are: Poor Interpretability, Data Imbalance, Data Quality Issues, Too Little Data. Among them, Poor Interpretability is about deep learning algorithms. Poor interpretability leads to low reliability of deep learning disease prediction algorithms, which is not good for helping doctors analyze pathological causes. The remaining three problems are related to the data. Data Imbalance will cause the classifier to lose its classification ability. Data Quality Issues, poor quality datasets will lower the performance limit of deep learning algorithms on specific problems. Too Little Data, a small amount of data will lead to over-fitting and seriously reduce the quality of deep learning algorithms. In addition to enumerating these problems, this section also presents the current corresponding solutions.

4.1 Poor interpretability

Traditional statistical methods are usually based on manual feature engineering of medical related domain knowledge. These methods are closely combined with medical knowledge. Although the effect is not very outstanding, they give doctors reliable interpretability. Deep learning algorithms are like a black box and are driven by data which we cannot see the feature extraction and screening process. Therefore, although deep learning improves the feature extraction ability and classification ability of the model, its interpretability is very poor, which is easy to lead to the unreliability of the results and bring risks. Only by solving the interpretability problem of the model, deep learning can be more widely used in the actual disease prediction, better serve doctors and patients, and make them have confidence in the diagnosis results of the model.

The general solution is to add attention mechanism, which is suitable for both structured and unstructured data. Attention mechanism was first applied to the field of natural language processing, which can better find the relationship between words in sentences and better predict the next words. AFM and DeepAFM are the application of attention mechanism in FM algorithm; Woo et al. [ 81 ] proposed the Convolutional Block Attention Module (CBAM) in 2018. Woo et al. [ 81 ] proposed convolutional attention module (CBAM) in 2018. Given an intermediate feature map, CBAM module will infer the attention map along two independent dimensions (channel and space), and then multiply the attention map with the input feature map. CBAM is a lightweight general module, which can be seamlessly integrated into any CNN architecture for end-to-end training with basic CNN without excessive additional over-head.

The Local Interpretable Model Agnostic Explanations (LIME) can also be adopted to solve the problem of poor interpretability. LIME establishes a linear separable model locally in the model through local disturbance sampling and linear approximation, and estimates the importance of each feature through the feature weight of the linear model [ 82 , 83 ].

For images, interpretability methods based on activation mapping can be adopted, such as Class Activation Mapping (CAM) [ 84 ], Grad-CAM [ 85 ], Grad-CAM++ [ 86 ], and Score-CAM [ 87 ], etc. This method generates saliency map by linear weighted combination of activation mapping to highlight important areas in image space. The saliency map is used to highlight the features in the input considered to be related to the prediction of the learning model, which does not need training data or modify the model.

4.2 Data imbalance

There is always an imbalance in medical data because there are fewer people who are sick than those who are not. When the data is severely unbalanced, the model always classifies the samples into the majority class, for example if a model is trained to predict whether a patient has a tumor, when the number of negative samples (patients without a tumor) in the training set is much higher than When the number of positive samples is positive, when predicting whether a new patient has a tumor, the model always diagnoses the patient as not having a tumor, which is obviously not what we want.

For image data, Generative Adversarial Networks (GAN) [ 88 ] can be used. GAN can generate minority class samples that are close to real samples and solve the problem of data imbalance. For binary classification problems, the method of Synthetic Minority Oversampling Technique, SMOTE [ 89 ] can also be used. SMOTE can up-sample or down-sample the training set, so that the proportion of positive and negative samples reaches a balanced state.

Structured data can also use the SMOTE method, but up-sampling will destroy the discreteness of the data, making discrete features into continuous features, resulting in inconsistent data types in the training and test sets, which is not conducive to the learning of FM algorithms. If the number of minority class samples is too small, using down-sampling will lead to a serious shortage of training samples. These are questions to be studied in the future.

4.3 Data quality issues

Data quality remains the biggest challenge in model training. The excellent performance of deep learning models in disease prediction relies on high-quality medical data. While medical data is readily available under existing conditions, the quality of the data remains low. Moreover, there may be problems such as the mismatch between the training samples and the real samples and the existence of some abnormal features, which will affect the model effect. There is also a lot of medical data that requires experienced medical experts to give sample labels.

For image, speech and other types of data, the quality can be improved by using GAN, up-sampling, Fourier transform and other methods. For structured data, methods such as filling in missing values, deleting duplicate values, and outliers are often used for data cleaning, and methods such as discretization, filter, wrapper, and Principal Component Analysis (PCA) are used for feature selection to obtain higher-quality samples. Since we are talking about deep learning algorithms, it is possible to build end-to-end deep learning algorithms like DeepFM, without feature engineering, and let deep learning exert automatic feature learning capabilities to overcome data quality issues. The automatic learning ability of deep learning can also be applied to sample label processing, which involves unsupervised learning and is beyond the scope of this article.

4.4 Too little data

Although a large amount of health data has been generated at present, many medical data sets involve privacy issues, which are stored in independent institutions and are not made public. Therefore, a large number of data sets can’t be used for practical research, so the model can’t be fully trained, and it’s hard to exert its real effect. Here we only discuss how to solve the problem from the aspect of algorithms.

For images, the method of Few-shot Learning [ 90 , 91 , 92 ] can be used, that is, the model is trained through a large number of tasks to improve the generalization ability of the model. When faced with similar new tasks, the model can be trained after a small number of iterations. achieve better results. Few-shot Learning includes the following methods in total: (1) model fine-tuning [ 93 , 94 ], obtaining a pre-trained model on a source dataset with a large number of samples, and then fine-tuning the pre-trained model on a target dataset with a small number of samples. This method is more suitable for scenarios where the source dataset and target dataset are similar, but in practical scenarios, the two datasets are usually dissimilar, which often leads to over-fitting. (2) Data augmentation refers to the use of some additional datasets or information to expand the target data set or enhance the characteristics of the samples in the target data set [ 95 , 96 ]. In the early stage, the data set was expanded through spatial transformation, but this could not expand the types of samples. Later, people used methods such as GANs for data augmentation. Meta learning refers to letting the model learn meta-knowledge from a large number of tasks, and use this meta-knowledge to quickly adapt to different new tasks. Meta learning includes Memory NN [ 97 , 98 ], Meta Network [ 99 ], Model-Agnostic Meta-Learning (MAML) [ 100 ] and other algorithms. Metric learning, also known as similarity learning, calculates the distance between two samples through a distance function, so as to measure the similarity between them and determine whether they belong to the same category. The metric learning algorithm consists of an embedding module and a measurement module. The embedding module converts the samples into vectors in a low-dimensional vector space, and the measurement module gives the similarity between samples. Metric learning is divided into fixed distance based metric learning [ 101 ] and learnable network based metric learning [ 102 ].

However, few-shot learning is mainly applied to images, and it is often ineffective in structured data. Because the idea of Few-shot Learning is similar to that of a child distinguishing animals, after seeing a lot of animal pictures, give him a picture of a rhino, and he can find a rhino among many animals. Images have certain similarity and have a general large data set, so they can meet the requirements of a large number of similar tasks. However, different diseases have different features, and these features have different characteristics. Therefore, there is no general large data set, which is difficult to meet the requirements of a large number of similar tasks. At present, there are traditional machine learning algorithms (low complexity), Boosting sampling algorithms, and feature selection to solve the problem of small amount of structured data. Among them, traditional machine learning algorithms and feature selection make up for the overfitting problem caused by the small amount of data by reducing the complexity (model complexity or feature complexity). There is no more effective way to solve this problem.

5 Future works and prospects

5.1 incorporating digital twins.

Digital Twins refers to building the same entity in the digital world through digital means to realize the understanding, analysis and optimization of the physical entity. With the development of technologies such as AI, Big Data, Virtual reality, IoT, and cloud computing [ 103 , 104 ]. Digital Twins have begun to shine in industrial, medical and other fields. The application of Digital Twins in medical care is usually to create a model based on real medical data in the virtual world, and then observe and analyze the stimulus changes of the model to various conditions, such as the feedback generated by the intervention of new drugs or new treatment regimens. These real medical data come from EHRs, daily behavior databases, medical wearable devices, and more. Therefore, through Digital Twins, medical activities such as health detection, telemedicine, early disease diagnosis, and disease treatment can be realized [ 105 , 106 ], providing revolutionary solutions in the field of healthcare [ 107 ]. Health monitoring is an important means in modern medicine. The use of various wearable sensors in the Digital Twins can realize ubiquitous monitoring of the health status of patients [ 108 ], and can also reduce medical costs, reduce the number of hospitalizations, and improve the quality of life of patients [ 109 , 110 ].

Digital Twins can be combined with deep learning algorithm of disease prediction to realize faster and more developed electronic medical treatment and automated medical treatment. The general realization methods are as follows: firstly, collect data, and use various sensors, especially various convenient wearable sensors to collect various health information [ 111 , 112 ], and transmit these data to the cloud. It can also collect various electronic medical record data and daily behavior database data. Then, using these collected medical data, a digital model of disease prediction is established in the cloud by deep learning algorithm. Finally, the digital model is used to process and analyze the health data, so as to predict the patient’s physical condition, whether he is ill or not, the probability of illness, etc. In the process of analysis, new knowledge and new information will be generated [ 113 ], which will help to adjust and upgrade the model, and help related researchers to better understand the mechanism behind the disease, so as to find a better treatment.

Many scholars have proposed a combination of Digital Twins and deep learning. For example, Chakshu et al. [ 114 ] proposed a method to achieve cardiovascular Digital Twins using reverse analysis, which uses a virtual patient database. By inputting pressure waveforms from three non-invasively accessible blood vessels (carotid, femoral, and brachial), the blood pressure waveforms in various blood vessels of the body are calculated backwards with the help of LSTM cells. The reverse analysis system established by this method is mainly used for the detection of abdominal aortic aneurysm and its severity. Quilodrán-Casas et al. [ 115 ] created two Digital Twins systems of SEIRS models and applied them to simulate the spatial and temporal propagation of COVID-19, and compared their prediction results with real data. They compared the performance of the two digital twin models [also known as Non-invasive Reduced Order Model (NIROM)]. The first method is to use PCA for dimensionality reduction and Bi-LSTM with data correction (through optimal interpolation) for prediction. The second NIROM uses PCA for dimensionality reduction again and GAN for prediction. In addition, there are many related studies.

In the future, we should realize a more intelligent processing mode through Digital Twins and deep learning model, realize a truly automatic and intelligent medical system, and greatly reduce the workload of doctors. At the same time, more Digital Twins medical system platforms need to be developed to achieve a wider range of intelligent medical treatment. Intelligent medical treatment is one of the important links of smart city, and intelligent medical treatment is indispensable to the realization of smart city. Therefore, on the basis of ensuring the security of Digital Twins medical platform, we should further broaden the scope of application and serve the user group more comprehensively. Intelligence is one of the core elements of future medical and urban development. To truly realize comprehensive medical intelligence, we must better integrate medical Digital Twins and deep learning algorithm technology.

5.2 Promoting precision medicine

Precision medicine is the principle and practice of integrating modern medical technology and traditional medical methods, scientifically understanding human body functions and the nature of diseases, systematically optimizing the principles and practices of human disease prevention and control, and maximizing individual and social health benefits with efficient, safe and economical health care services. In clinical practice, precision medicine pursues accurate and reasonable diagnosis and treatment methods for each patient in order to minimize iatrogenic damage, minimize medical costs and maximize patient benefits. Compared with traditional medicine, it can provide patients with more effective, cheaper and more timely medical services. Since it was proposed in 2015, it has been the key to global healthcare and one of the important goals of many sustainable development plans around the world [ 116 , 117 ]. The concept of precision medicine opens up new ideas for human health and healthcare [ 118 , 119 ].

Like personalized medicine, precision medicine focuses on individual differences [ 120 ], exploring the impact of individual factors on disease [ 121 ]. Assessment of personal health from genomics, living environment, etc., coupled with clinical data analysis, will have higher performance. For example, Panayides et al. [ 122 ] proposed that starting from the methods of radiomics and radiogenomics, combined with precision medicine, some abnormal diseases can be found more quickly when dealing with disease problems. Precision medicine also has good performance in preventing malignant diseases, such as cancer [ 123 , 124 ], tumor [ 125 ] and so on. It can be said that disease prediction and disease treatment are moving towards the era of precision medicine [ 126 ].

At present, there are many researches on precision medicine in Western countries, but the research on precision medicine in the Asia–Pacific region is still in the initial stage. On the one hand, it is necessary to ensure the diversity and high quality of gene collection. On the other hand, it is necessary to extract the genetic characteristics consistent with the population of the Asia–Pacific region. These two are both urgent problems to be solved at present, and they are also the reasons that hinder development.

In the next era, precision medicine will be combined with multi-field applications. Realize the systematic operation of medical diagnosis and promote the development of medical care in a more intelligent direction. For example, Lu and Harrison [ 127 ] pointed out that CNN can realize large-scale medical image analysis and labeling, and can accurately obtain pathological information of different patients. Laplante and Akhloufi [ 128 ] proposed a deep NN classifier to identify the anatomical location of tumors. Using the 27 TCGA miRNA stem cell ring cohort, tumors at 20 anatomical sites were classified with 96.9% accuracy. Therefore, deep learning can be combined with precision medicine [ 129 ] to better process big data and fundamentally promote the development of precision medicine [ 130 ]. As part of precision medicine, accurate prediction of disease embodies enormous advantages and value, and can advance the development of modern medical technology. However, the current precision medicine is still in the stage of exploration and development [ 131 , 132 , 133 ], the research situation of different diseases is very different, and the application of deep learning technology is still in the development stage. In the future, AI-related researchers should focus more on precision medicine and build deep learning models that better meet the requirements of precision medicine in combination with the research on radiomics and genomics in the medical field. While promoting the progress of precision medicine, it also drives the multi-faceted development of deep learning, which is more in line with social needs.

6 Conclusion

This paper reviews the deep learning algorithms in the field of disease prediction. According to the type of data processed, the algorithms are divided into structured data algorithms and unstructured data algorithms. Structured data algorithms include ANN and FM-Deep Learning algorithms. Unstructured data algorithms include CNN, RNN, etc. This paper expounds the principle, development history and application of these algorithms in disease prediction. In the application part of disease prediction of each algorithm, try to analyze the literature according to the characteristics of the algorithm. Although these algorithms are the mainstream algorithms at present and in the future, there will be some problems in the current research, such as poor interpretability, sample imbalance, data quality, few samples in some cases, etc. This paper gives some temporary solutions, hoping to have better solutions in the future. At the end of the article, we elaborate and analyze the two development trends of disease prediction in the future. The future medical technology should be combined with Digital Twins to realize real intelligent medical treatment, pay more attention to personalized medical treatment, integrate with precision medical treatment, and serve individuals more conveniently. This paper can enlighten relevant researchers, help them understand the current development, existing problems and future development trend of disease prediction algorithms, and let them focus on hot spot algorithms, combine current advanced technologies and concepts, and make more efficient, effective and reasonable research with the goal of medical development trend.

Data availability

Not applicable.

Code availability

Consent for publication.

We agree to publish.

Maurya, M.R., Riyaz, N.U., Reddy, M., Yalcin, H.C., Ouakad, H.M., Bahadur, I., Al-Maadeed, S., Sadasivuni, K.K.: A review of smart sensors coupled with Internet of Things and artificial intelligence approach for heart failure monitoring. Med. Biol. Eng. Comput. 59 (11), 2185–2203 (2021)

Article   Google Scholar  

Shamshirband, S., Fathi, M., Dehzangi, A., Chronopoulos, A.T., Alinejad-Rokny, H.: A review on deep learning approaches in healthcare systems: taxonomies, challenges, and open issues. J. Biomed. Inform. 113 , 103627 (2021)

Hossain, M.S., Muhammad, G.: Deep learning based pathology detection for smart connected healthcare. IEEE Netw. 34 (6), 120–125 (2020)

Article   MathSciNet   Google Scholar  

Kumar, P.M., Gandhi, U.D.: A novel three-tier Internet of Things architecture with machine learning algorithm for early detection of heart diseases. Comput. Electr. Eng. 65 , 222–235 (2018)

Bakator, M., Radosav, D.: Deep learning and medical diagnosis: a review of literature. Multimodal Technol. Interact. 2 (3), 47 (2018)

Rendle, S.: Factorization machines. In: 2010 IEEE International Conference on Data Mining, pp. 995–1000. IEEE (2010)

Lin, X., Zhang, W., Zhang, M., Zhu, W., Pei, J., Zhao, P., Huang, J.: Online compact convexified factorization machine. In: Proceedings of the 2018 World Wide Web Conference, 2018, pp. 1633–1642 (2018)

Al-Galal, S.A.Y., Alshaikhli, I.F.T., Abdulrazzaq, M.: MRI brain tumor medical images analysis using deep learning techniques: a systematic review. Health Technol. 11 , 1–16 (2021)

Leevy, J.L., Khoshgoftaar, T.M., Villanustre, F.: Survey on RNN and CRF models for de-identification of medical free text. J. Big Data 7 (1), 1–22 (2020)

Hossain, M.S., Muhammad, G., Guizani, N.: Explainable AI and mass surveillance system-based healthcare framework to combat COVID-i9 like pandemics. IEEE Netw. 34 (4), 126–132 (2020)

Shorfuzzaman, M., et al.: MetaCOVID: a Siamese neural network framework with contrastive loss for n-shot diagnosis of COVID-19 patients. Pattern Recognit. 113 , 107700 (2021)

Khanam, J.J., Foo, S.Y.: A comparison of machine learning algorithms for diabetes prediction. ICT Express 7 (4), 432–439 (2021)

Soundarya, S., Sruthi, M., Bama, S.S., Kiruthika, S., Dhiyaneswaran, J.: Early detection of Alzheimer disease using gadolinium material. Mater. Today Proc. 45 , 1094–1101 (2021)

Pasha, S.N., Ramesh, D., Mohmmad, S., Harshavardhan, A., et al.: Cardiovascular disease prediction using deep learning techniques. IOP Conf. Ser. Mater. Sci. Eng. 981 , 022006 (2020)

Chen, C., Dongxing, W., Chunyan, H., Xiaojie, Y.: Exploiting social media for stock market prediction with factorization machine. In: 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 2, pp. 142–149. IEEE (2014)

Zhang, W., Du, T., Wang, J.: Deep learning over multi-field categorical data. In: European Conference on Information Retrieval, pp. 45–57. Springer (2016)

Qu, Y., Cai, H., Ren, K., Zhang, W., Yu, Y., Wen, Y., Wang, J.: Product-based neural networks for user response prediction. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), 2016, pp. 1149–1154. IEEE (2016)

He, X., Chua, T.-S.: Neural factorization machines for sparse predictive analytics. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 355–364 (2017)

Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., et al.: Wide and deep learning for recommender systems. In: Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, 2016, pp. 7–10 (2016)

Guo, H., Tang, R., Ye, Y., Li, Z., He, X.: DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint (2017). arXiv:1703.04247

Xiao, J., Ye, H., He, X., Zhang, H., Wu, F., Chua, T.-S.: Attentional factorization machines: learning the weight of feature interactions via attention networks. arXiv preprint (2017). arXiv:1708.04617

Zhang, J., Wu, Z., Li, F., Li, W., Ren, T., Li, W., Chen, J.: Deep attentional factorization machines learning approach for driving safety risk prediction. J. Phys. Conf. Ser. 1732 , 012007 (2021)

Zhang, J., Huang, T., Zhang, Z.: FAT-DeepFFM: field attentive deep field-aware factorization machine. arXiv preprint (2019). arXiv:1905.06336

Tao, Z., Wang, X., He, X., Huang, X., Chua, T.-S.: HoAFM: a high-order attentive factorization machine for CTR prediction. Inf. Process. Manag. 57 (6), 102076 (2020)

Yu, H., Yin, J., Li, Y.: Gate attentional factorization machines: an efficient neural network considering both accuracy and speed. Appl. Sci. 11 (20), 9546 (2021)

Wen, P., Yuan, W., Qin, Q., Sang, S., Zhang, Z.: Neural attention model for recommendation based on factorization machines. Appl. Intell. 51 (4), 1829–1844 (2021)

Zhou, F., Zhou, H.-M., Yang, Z., Yang, L.: EMD2FNN: a strategy combining empirical mode decomposition and factorization machine based neural network for stock market trend prediction. Expert Syst. Appl. 115 , 136–151 (2019)

Zhang, W., Zhang, X., Wang, H.: High-order factorization machine based on cross weights network for recommendation. IEEE Access 7 , 145746–145756 (2019)

Lu, W., Yu, Y., Chang, Y., Wang, Z., Li, C., Yuan, B.: A dual input-aware factorization machine for CTR prediction. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2021, pp. 3139–3145 (2021)

Deng, W., Pan, J., Zhou, T., Flores, A., Lin, G.: A sparse deep factorization machine for efficient CTR prediction. arXiv preprint (2020). arXiv:2002.06987

Yu, Y., Jiao, L., Zhou, N., Zhang, L., Yin, H.: Enhanced factorization machine via neural pairwise ranking and attention networks. Pattern Recognit. Lett. 140 , 348–357 (2020)

Pande, H.: Field-embedded factorization machines for click-through rate prediction. arXiv preprint (2020). arXiv:2009.09931

Qi, G., Li, P.: Deep field-aware interaction machine for click-through rate prediction. Mob. Inf. Syst. (2021). https://doi.org/10.1155/2021/5575249

Zhang, Q.-L., Rao, L., Yang, Y.: DGFFM: generalized field-aware factorization machine based on DenseNet. In: 2019 International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–8. IEEE (2019)

Chanaa, A., El Faddouli, N.-E.: Latent graph predictor factorization machine (LGPFM) for modeling feature interactions weight. In: Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications, 2020, pp. 1–5 (2020)

Guo, Y., Cheng, Z., Jing, J., Lin, Y., Nie, L., Wang, M.: Enhancing factorization machines with generalized metric learning. IEEE Trans. Knowl. Data Eng. 34 (8), 3740–3753 (2020)

Chen, X., Qian, J.: An assistant diagnosis system for sepsis in children based on neural network and factorization. Sci. Technol. Eng. (2017)

Ronge, R., Nho, K., Wachinger, C., Pölsterl, S.: Alzheimer’s disease diagnosis via deep factorization machine models. In: International Workshop on Machine Learning in Medical Imaging, 2021, pp. 624–633. Springer (2021)

Fan, Y., Li, D., Liu, Y., Feng, M., Chen, Q., Wang, R.: Toward better prediction of recurrence for Cushing’s disease: a factorization-machine based neural approach. Int. J. Mach. Learn. Cybern. 12 (3), 625–633 (2021)

LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1 (4), 541–551 (1989)

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86 (11), 2278–2324 (1998)

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 , 1097–1105 (2012)

Google Scholar  

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint (2014). arXiv:1409.1556

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9 (2015)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778 (2016)

Acharya, U.R., Oh, S.L., Hagiwara, Y., Tan, J.H., Adeli, H.: Deep convolutional neural network for the automated detection and diagnosis of seizure using EEG signals. Comput. Biol. Med. 100 , 270–278 (2018)

Muhammad, G., et al.: EEG-based pathology detection for home health monitoring. IEEE J. Sel. Areas Commun. 39 (2), 603–610 (2021)

Hossain, M.S., Amin, S.U., Alsulaiman, M., Muhammad, G.: Applying deep learning for epilepsy seizure detection and brain mapping visualization. ACM Trans. Multimed. Comput. Commun. Appl. 15 (1s), 1–17 (2019)

Chanu, M.M., Thongam, K.: Computer-aided detection of brain tumor from magnetic resonance images using deep learning network. J. Ambient Intell. Humaniz. Comput. 12 (7), 6911–6922 (2021)

Seven, G., Silahtaroglu, G., Kochan, K., Ince, A.T., Arici, D.S., Senturk, H.: Use of artificial intelligence in the prediction of malignant potential of gastric gastrointestinal stromal tumors. Dig. Dis. Sci. 67 (1), 273–281 (2022)

Yin, X.: Pigmented skin lesions image classification based on residual network. In: 2021 6th International Conference on Machine Learning Technologies, 2021, pp. 74–81 (2021)

Rahman, A., et al.: Adversarial examples—security threats to COVID-19 deep learning systems in medical IoT devices. IEEE Internet Things J. 8 (12), 9603–9610 (2021)

Amin, J., Sharif, M., Yasmin, M., Saba, T., Anjum, M.A., Fernandes, S.L.: A new approach for brain tumor segmentation and classification based on score level fusion using transfer learning. J. Med. Syst. 43 (11), 1–16 (2019)

Wang, B., Perronne, L., Burke, C., Adler, R.S.: Artificial intelligence for classification of soft-tissue masses at us. Radiol. Artif. Intell. 3 (1), 200125 (2020)

Chelghoum, R., Ikhlef, A., Hameurlaine, A., Jacquir, S.: Transfer learning using convolutional neural network architectures for brain tumor classification from MRI images. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, 2020, pp. 189–200. Springer (2020)

Kaur, T., Gandhi, T.K.: Deep convolutional neural networks with transfer learning for automated brain image classification. Mach. Vis. Appl. 31 (3), 1–16 (2020)

Rehman, A., Naz, S., Razzak, M.I., Akram, F., Imran, M.: A deep learning-based framework for automatic brain tumors classification using transfer learning. Circuits Syst. Signal Process. 39 (2), 757–775 (2020)

Kumar, S.S., Nandhini, M.: Entropy slicing extraction and transfer learning classification for early diagnosis of Alzheimer diseases with sMRI. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 17 (2), 1–22 (2021)

Tsai, M.-J., Tao, Y.-H.: Deep learning techniques for colorectal cancer tissue classification. In: 2020 14th International Conference on Signal Processing and Communication Systems (ICSPCS), 2020, pp. 1–8. IEEE (2020)

Eweje, F.R., Bao, B., Wu, J., Dalal, D., Liao, W.-H., He, Y., Luo, Y., Lu, S., Zhang, P., Peng, X., et al.: Deep learning for classification of bone lesions on routine MRI. EBioMedicine 68 , 103402 (2021)

Kokkalla, S., Kakarla, J., Venkateswarlu, I.B., Singh, M.: Three-class brain tumor classification using deep dense inception residual network. Soft Comput. 25 (13), 8721–8729 (2021)

Ning, W., Li, S., Wei, D., Guo, L.Z., Chen, H.: Automatic detection of congestive heart failure based on a hybrid deep learning algorithm in the Internet of Medical Things. IEEE Internet Things J. 8 (16), 12550–12558 (2020)

Srinivasu, P.N., SivaSai, J.G., Ijaz, M.F., Bhoi, A.K., Kim, W., Kang, J.J.: Classification of skin disease using deep learning neural networks with MobileNet V2 and LSTM. Sensors 21 (8), 2852 (2021)

Toğaçar, M., Ergen, B., Cömert, Z.: Tumor type detection in brain MR images of the deep model developed using hypercolumn technique, attention modules, and residual blocks. Med. Biol. Eng. Comput. 59 (1), 57–70 (2021)

Jiao, Z., Gao, X., Wang, Y., Li, J.: A parasitic metric learning net for breast mass classification based on mammography. Pattern Recognit. 75 , 292–301 (2018)

Tripathi, S., Singh, S.K.: Cell nuclei classification in histopathological images using hybrid OLConvNet. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 16 (1s), 1–22 (2020)

Cheng, Y., Wang, F., Zhang, P., Hu, J.: Risk prediction with electronic health records: a deep learning approach. In: Proceedings of the 2016 SIAM International Conference on Data Mining, 2016, pp. 432–440. SIAM (2016)

Gunduz, H.: Deep learning-based Parkinson’s disease classification using vocal feature sets. IEEE Access 7 , 115540–115551 (2019)

Sajja, T.K., Kalluri, H.K.: A deep learning method for prediction of cardiovascular disease using convolutional neural network. Rev. d’Intell. Artif. 34 (5), 601–606 (2020)

Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv preprint (2014). arXiv:1409.2329

Ma, Q., Lin, Z., Yan, J., Chen, Z., Yu, L.: Mode-LSTM: a parameter-efficient recurrent network with multi-scale for sentence classification. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6705–6715 (2020)

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint (2014). arXiv:1406.1078

Pham, T., Tran, T., Phung, D., Venkatesh, S.: DeepCare: a deep dynamic memory model for predictive medicine. In: Pacific–Asia Conference on Knowledge Discovery and Data Mining, 2016, pp. 30–41. Springer (2016)

Chu, J., Dong, W., He, K., Duan, H., Huang, Z.: Using neural attention networks to detect adverse medical events from electronic health records. J. Biomed. Inform. 87 , 118–130 (2018)

Tran, D.T., Vo, H.T., Nguyen, D.D., Nguyen, Q.M., Huynh, L.T., Le, L.T., Do, H.T., Quan, T.T.: A predictive model for ECG signals collected from specialized IoT devices using deep learning. In: 2018 5th NAFOSTED Conference on Information and Computer Science (NICS), 2018, pp. 424–429. IEEE (2018)

Reddy, B.K., Delen, D.: Predicting hospital readmission for lupus patients: an RNN–LSTM-based deep-learning methodology. Comput. Biol. Med. 101 , 199–209 (2018)

Wang, L., Sha, L., Lakin, J.R., Bynum, J., Bates, D.W., Hong, P., Zhou, L.: Development and validation of a deep learning algorithm for mortality prediction in selecting patients with dementia for earlier palliative care interventions. JAMA Netw. Open 2 (7), 196972–196972 (2019)

Choi, E., Schuetz, A., Stewart, W.F., Sun, J.: Using recurrent neural network models for early detection of heart failure onset. J. Am. Med. Inform. Assoc. 24 (2), 361–370 (2017)

Choi, E., Bahadori, M.T., Schuetz, A., Stewart, W.F., Sun, J.: Doctor AI: predicting clinical events via recurrent neural networks. In: Machine Learning for Healthcare Conference, 2016, pp. 301–318. PMLR (2016)

Amin, J., Sharif, M., Raza, M., Saba, T., Sial, R., Shad, S.A.: Brain tumor detection: a long short-term memory (LSTM)-based learning model. Neural Comput. Appl. 32 (20), 15965–15973 (2020)

Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19 (2018)

Holzinger, A., Langs, G., Denk, H., Zatloukal, K., Müller, H.: Causability and explainability of artificial intelligence in medicine. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 9 (4), 1312 (2019)

Ribeiro, M.T., Singh, S., Guestrin, C.: “why should I trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1135–1144 (2016)

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929 (2016)

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 618–626 (2017)

Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 839–847. IEEE (2018)

Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., Mardziel, P., Hu, X.: Score-CAM: score-weighted visual explanations for convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 24–25 (2020)

Mehrotra, A., Dukkipati, A.: Generative adversarial residual pairwise networks for one shot learning. arXiv preprint (2017). arXiv:1703.08033

Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing, 2005, pp. 878–887. Springer (2005)

Xinye, L., Shenpeng, L., Jing, Z.: Survey of few-shot learning based on deep neural network. Appl. Res. Comput. 37 (08), 2241–2247 (2020)

Wang, Y., Yao, Q., Kwok, J.T., Ni, L.M.: Generalizing from a few examples: a survey on few-shot learning. ACM Comput. Surv. (CSUR) 53 (3), 1–34 (2020)

Lee, K., Maji, S., Ravichandran, A., Soatto, S.: Meta-learning with differentiable convex optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10657–10665 (2019)

Gidaris, S., Komodakis, N.: Dynamic few-shot visual learning without forgetting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4367–4375 (2018)

Nakamura, A., Harada, T.: Revisiting fine-tuning for few-shot learning. arXiv preprint (2019). arXiv:1910.00216

Dixit, M., Kwitt, R., Niethammer, M., Vasconcelos, N.: AGA: attribute-guided augmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7455–7463 (2017)

Shen, W., Shi, Z., Sun, J.: Learning from adversarial features for few-shot classification. arXiv preprint (2019). arXiv:1903.10225

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T.: One-shot learning with memory-augmented neural networks. arXiv preprint (2016). arXiv:1605.06065

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T.: Meta-learning with memory-augmented neural networks. In: International Conference on Machine Learning, 2016, pp. 1842–1850. PMLR (2016)

Munkhdalai, T., Yu, H.: Meta networks. In: International Conference on Machine Learning, 2017, pp. 2554–2563. PMLR (2017)

Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, 2017, pp. 1126–1135. PMLR (2017)

Wu, X., Sahoo, D., Hoi, S.: Meta-RCNN: meta learning for few-shot object detection. In: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1679–1687 (2020)

Xiao, J., Xu, H., Zhao, W., Cheng, C., Gao, H.: A prior-mask-guided few-shot learning for skin lesion segmentation. Computing (2021). https://doi.org/10.1007/s00607-021-00907-z

El Saddik, A., Laamarti, F., Alja’Afreh, M.: The potential of digital twins. IEEE Instrum. Meas. Mag. 24 (3), 36–41 (2021)

Hossain, M.S., Muhammad, G.: Emotion-aware connected healthcare big data towards 5G. IEEE Internet Things J. 5 (4), 2399–2406 (2018)

Rathore, M.M., Shah, S.A., Shukla, D., Bentafat, E., Bakiras, S.: The role of AI, machine learning, and big data in digital twinning: a systematic literature review, challenges, and opportunities. IEEE Access 9 , 32030–32052 (2021)

El Saddik, A.: Digital twins: the convergence of multimedia technologies. IEEE MultiMed. 25 (2), 87–92 (2018)

Erol, T., Mendi, A., Dogan, D.: The digital twin revolution in healthcare, pp. 1–7 (2020). https://doi.org/10.1109/ISMSIT50672.2020.9255249

Pantelopoulos, A., Bourbakis, N.G.: A survey on wearable sensor-based systems for health monitoring and prognosis. IEEE Trans. Syst. Man Cybern. C 40 (1), 1–12 (2009)

Nguyen, H.H., Mirza, F., Naeem, M.A., Nguyen, M.: A review on IoT healthcare monitoring applications and a vision for transforming sensor data into real-time clinical feedback. In: 2017 IEEE 21st International Conference on Computer Supported Cooperative Work in Design (CSCWD), 2017, pp. 257–262. IEEE (2017)

Hossain, M.S.: Cloud-supported cyber–physical localization framework for patients monitoring. IEEE Syst. J. 11 (1), 118–127 (2017)

Vesnic-Alujevic, L., Breitegger, M., Pereira, Â.G.: ‘Do-it-yourself-healthcare? Quality of health and healthcare through wearable sensors. Sci. Eng. Ethics 24 (3), 887–904 (2018)

Hossain, M.S., Muhammad, G., Alamri, A.: Smart healthcare monitoring: a voice pathology detection paradigm for smart cities. Multimed. Syst. 25 (5), 565–575 (2019)

Wickramasinghe, N., Jayaraman, P.P., Zelcer, J., Forkan, A.R.M., Ulapane, N., Kaul, R., Vaughan, S.: A vision for leveraging the concept of digital twins to support the provision of personalised cancer care. IEEE Internet Comput. (2021). https://doi.org/10.1109/MIC.2021.3065381

Chakshu, N.K., Sazonov, I., Nithiarasu, P.: Towards enabling a cardiovascular digital twin for human systemic circulation using inverse analysis. Biomech. Model. Mechanobiol. 20 (2), 449–465 (2021)

Quilodrán-Casas, C., Silva, V.L., Arcucci, R., Heaney, C.E., Guo, Y., Pain, C.C.: Digital twins based on bidirectional LSTM and GAN for modelling the COVID-19 pandemic. Neurocomputing 470 , 11–28 (2022)

Afzal, M., Islam, S.R., Hussain, M., Lee, S.: Precision medicine informatics: principles, prospects, and challenges. IEEE Access 8 , 13593–13612 (2020)

Shorfuzzaman, M., et al.: Towards the sustainable development of smart cities through mass video surveillance: a response to the COVID-19 pandemic. Sustain. Cities Soc. 64 , 102582 (2021)

Llovet, J.M., Montal, R., Sia, D., Finn, R.S.: Molecular therapies and precision medicine for hepatocellular carcinoma. Nat. Rev. Clin. Oncol. 15 (10), 599–616 (2018)

Le Tourneau, C., Borcoman, E., Kamal, M.: Molecular profiling in precision medicine oncology. Nat. Med. 25 (5), 711–712 (2019)

Fujiwara, N., Friedman, S.L., Goossens, N., Hoshida, Y.: Risk factors and prevention of hepatocellular carcinoma in the era of precision medicine. J. Hepatol. 68 (3), 526–549 (2018)

Zhang, S., Bamakan, S.M.H., Qu, Q., Li, S.: Learning for personalized medicine: a comprehensive review from a deep learning perspective. IEEE Rev. Biomed. Eng. 12 , 194–208 (2018)

Panayides, A.S., Pattichis, M.S., Leandrou, S., Pitris, C., Constantinidou, A., Pattichis, C.S.: Radiogenomics for precision medicine with a big data analytics perspective. IEEE J. Biomed. Health Inform. 23 (5), 2063–2079 (2018)

Loomans-Kropp, H.A., Umar, A.: Cancer prevention and screening: the next step in the era of precision medicine. NPJ Precis. Oncol. 3 (1), 1–8 (2019)

Regel, I., Mayerle, J., Ujjwal Mukund, M.: Current strategies and future perspectives for precision medicine in pancreatic cancer. Cancers 12 (4), 1024 (2020)

Steuer, C.E., Ramalingam, S.S.: Tumor mutation burden: leading immunotherapy to the era of precision medicine? J. Clin. Oncol. Off. J. Am. Soc. Clin. Oncol. 36 (7), 631–632 (2018)

Hamamoto, R., Suvarna, K., Yamada, M., Kobayashi, K., Shinkai, N., Miyake, M., Takahashi, M., Jinnai, S., Shimoyama, R., Sakai, A., et al.: Application of artificial intelligence technology in oncology: towards the establishment of precision medicine. Cancers 12 (12), 3532 (2020)

Lu, L., Harrison, A.P.: Deep medical image computing in preventive and precision medicine. IEEE MultiMed. 25 (3), 109–113 (2018)

Laplante, J.-F., Akhloufi, M.A.: Predicting cancer types from miRNA stem-loops using deep learning. In: 2020 42nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2020, pp. 5312–5315. IEEE (2020)

Ahmed, Z., Mohamed, K., Zeeshan, S., Dong, X.: Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine. Database (2020). https://doi.org/10.1093/database/baaa010

Hulsen, T., Jamuar, S.S., Moody, A.R., Karnes, J.H., Varga, O., Hedensted, S., Spreafico, R., Hafler, D.A., McKinney, E.F.: From big data to precision medicine. Front. Med. 6 , 34 (2019)

Hey, S.P., Gerlach, C.V., Dunlap, G., Prasad, V., Kesselheim, A.S.: The evidence landscape in precision medicine. Sci. Transl. Med. (2020). https://doi.org/10.1126/scitranslmed.aaw7745

Dienstmann, R., Vermeulen, L., Guinney, J., Kopetz, S., Tejpar, S., Tabernero, J.: Consensus molecular subtypes and the evolution of precision medicine in colorectal cancer. Nat. Rev. Cancer 17 (2), 79–92 (2017)

Dayem Ullah, A.Z., Oscanoa, J., Wang, J., Nagano, A., Lemoine, N.R., Chelala, C.: SNPnexus: assessing the functional relevance of genetic variation to facilitate the promise of precision medicine. Nucleic Acids Res. 46 (W1), 109–113 (2018)

Download references

Author information

Authors and affiliations.

College of Computer Science and Technology, Qingdao University, Ningxia Road, Qingdao, 266071, China

Zengchen Yu, Zhibo Wan & Shuxuan Xie

Psychiatric Department, Qingdao Municipal Hospital, Zhuhai Road, Qingdao, 266071, China

Department of Game Design, Faculty of Arts, Uppsala University, 75105, Uppsala, Sweden

You can also search for this author in PubMed   Google Scholar

Contributions

The contributions of all authors are the same.

Corresponding author

Correspondence to Zhibo Wan .

Ethics declarations

Conflict of interest, ethical approval.

Our research does not address ethical issues.

Informed consent

We agree to participate.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Yu, Z., Wang, K., Wan, Z. et al. Popular deep learning algorithms for disease prediction: a review. Cluster Comput 26 , 1231–1251 (2023). https://doi.org/10.1007/s10586-022-03707-y

Download citation

Received : 02 February 2022

Revised : 07 July 2022

Accepted : 03 August 2022

Published : 13 September 2022

Issue Date : April 2023

DOI : https://doi.org/10.1007/s10586-022-03707-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Artificial neural network
  • Factorization machine
  • Convolutional neural network
  • Recurrent neural network
  • Find a journal
  • Publish with us
  • Track your research
  • Open access
  • Published: 20 November 2022

Chronic kidney disease prediction using machine learning techniques

  • Dibaba Adeba Debal 1 &
  • Tilahun Melak Sitote 2  

Journal of Big Data volume  9 , Article number:  109 ( 2022 ) Cite this article

31k Accesses

31 Citations

1 Altmetric

Metrics details

Goal three of the UN’s Sustainable Development Goal is good health and well-being where it clearly emphasized that non-communicable diseases is emerging challenge. One of the objectives is to reduce premature mortality from non-communicable disease by third in 2030. Chronic kidney disease (CKD) is among the significant contributor to morbidity and mortality from non-communicable diseases that can affected 10–15% of the global population. Early and accurate detection of the stages of CKD is believed to be vital to minimize impacts of patient’s health complications such as hypertension, anemia (low blood count), mineral bone disorder, poor nutritional health, acid base abnormalities, and neurological complications with timely intervention through appropriate medications. Various researches have been carried out using machine learning techniques on the detection of CKD at the premature stage. Their focus was not mainly on the specific stages prediction. In this study, both binary and multi classification for stage prediction have been carried out. The prediction models used include Random Forest (RF), Support Vector Machine (SVM) and Decision Tree (DT). Analysis of variance and recursive feature elimination using cross validation have been applied for feature selection. Evaluation of the models was done using tenfold cross-validation. The results from the experiments indicated that RF based on recursive feature elimination with cross validation has better performance than SVM and DT.

Introduction

Chronic kidney disease (CKD) is non-communicable disease that has significantly contributed to morbidity, mortality and admission rate of patients worldwide [ 2 ]. It is quickly expanding and becoming one of the major causes of death all over the world. A report from 1990 to 2013 indicated that the global yearly life loss caused by CKD increased by 90% and it is the 13th leading cause of death in the world [ 1 ]. 850 million people throughout the world are likely to have kidney diseases from different factors [ 3 ]. According to the report of world kidney day of 2019, at least 2.4 million people die every year due to kidney related disease. Currently, it is the 6th fastest-growing cause of death worldwideCKD is becoming a challenging public health problem with increasing prevalence worldwide. Its burden is even higher in low-income countries where detection, prevention and treatment remain low [ 2 ]. Kidney disease is serious public health problem in Ethiopia affecting hundreds of thousands of people irrespective of age, sex [ 4 ]. The lack of safe water, appropriate diet, and physical activities is believed have contributed. Additionally, communities living in rural area have limited knowledge about the CKD. According to WHO report of 2017, the number of deaths in Ethiopia due to kidney disease was 4,875. It is 0.77% of total deaths that has ranked the country 138th in the world. The age-adjusted death rate is 8.46 per 100,000 of the population and the death rate increased to 12.70 per 100,000 that has ranked the country 109 in 2018 [ 3 ].

National kidney foundation classifies stages of CKD into five based on the abnormal kidney function and reduced Glomerular Filtration Rate (GFR), which measures a level of kidney function,. The mildest stage (stage 1 and stage 2) is known with only a few symptoms and stage 5 is considered as end-stage or kidney failure. The Renal Replacement Therapy (RRT) cost for total kidney failure is very expensive. The treatment is not also available in most developing countries like Ethiopia. As a result, the management of kidney failure and its complications is very difficult in developing countries due to shortage of facilities, physicians, and the high cost to get the treatment [ 4 , 5 ]. Hence, early detection of CKD is very essential to minimize the economic burden and maximize the effectiveness of treatments [ 6 ]. Predictive analysis using machine learning techniques can be helpful through an early detection of CKD for efficient and timely interventions [ 7 ]. In this study, Random Forest (RF), Support Vector Machine (SVM) and Decision Tree (DT) have been used to detect CKD. Most of previous researches focused on two classes, which make treatment recommendations difficult because the type of treatment to be given is based on the severity of CKD.

Related works

Different machine-learning techniques have been used for effective classification of chronic kidney disease from patients’ data.

Charleonnan et al. [ 8 ] did comparison of the predictive models such as K-nearest neighbors (KNN), support vector machine (SVM), logistic regression (LR), and decision tree (DT) on Indians Chronic Kidney Disease (CKD) dataset in order to select best classifier for predicting chronic kidney disease. They have identified that SVM has the highest classification accuracy of 98.3% and highest sensitivity of 0.99.

Salekin and Stankovic [ 9 ] did evaluation of classifiers such as K-NN, RF and ANN on a dataset of 400. Wrapper feature selection were implemented and five features were selected for model construction in the study. The highest classification accuracy is 98% by RF and a RMSE of 0.11. S. Tekale et al. [ 10 ] worked on “Prediction of Chronic Kidney Disease Using Machine Learning Algorithm” with a dataset consists of 400 instances and 14 features. They have used decision tree and support vector machine. The dataset has been preprocessed and the number of features has been reduced from 25 to 14. SVM is stated as a better model with an accuracy of 96.75%.

Xiao et al. [ 11 ] proposed prediction of chronic kidney disease progression using logistic regression, Elastic Net, lasso regression, ridge regression, support vector machine, random forest, XGBoost, neural network and k-nearest neighbor and compared the models based on their performance. They have used 551 patients’ history data with proteinuria with 18 features and classified the outcome as mild, moderate, severe. They have concluded that Logistic regression performed better with AUC of 0.873, sensitivity and specificity of 0.83 and 0.82, respectively.

Mohammed and Beshah [ 13 ] conducted their research on developing a self-learning knowledge-based system for diagnosis and treatment of the first three stages of chronic kidney have been conducted using machine learning. A small number of data have been used in this research and they have developed prototype which enables the patient to query KBS to see the delivery of advice. They used decision tree in order to generate the rules. The overall performance of the prototype has been stated as 91% accurate.

Priyanka et al. [ 12 ] carried out chronic kidney disease prediction through naive bayes. They have tested using other algorithms such as KNN (K-Nearest Neighbor Algorithm), SVM (Support Vector Machines), Decision tree, and ANN (Artificial Neural Network) and they have got Naïve Bayes with better accuracy of 94.6% when compared to other algorithms.

Almasoud and Ward [ 13 ] aimed in their work to test the ability of machine learning algorithms for the prediction of chronic kidney disease using subset of features. They used Pearson correlation, ANOVA, and Cramer’s V test to select predictive features. They have done modeling using LR, SVM, RF, and GB machine learning algorithms. Finally, they concluded that Gradient Boosting has the highest accuracy with an F-measure of 99.1.

Yashfi [ 14 ] proposed to predict the risk of CKD using machine learning algorithms by analyzing the data of CKD patients. Random Forest and Artificial Neural Network have been used. They have extracted 20 out of 25 features and applied RF and ANN. RF has been identified with the highest accuracy of 97.12%.

Rady and Anwar [ 15 ] carried out the comparison of Probabilistic Neural Networks (PNN), Multilayer Perceptron (MLP), Support Vector Machine (SVM), and Radial Basis Function (RBF) algorithms to predict kidney disease stages. The researchers conducted their research on a small size dataset and few numbers of features. The result of this paper shows that the Probabilistic Neural Networks algorithm gives the highest overall classification accuracy percentage of 96.7%.

Alsuhibany et al. [ 16 ] presented ensemble of deep learning based clinical decision support systems (EDL-CDSS) for CKD diagnosis in the IoT environment. The presented technique involves Adaptive Synthetic (ADASYN) technique for outlier detection process and employed ensemble of three models, namely, deep belief network (DBN), kernel extreme learning machine (KELM), and convolutional neural network with gated recurrent unit (CNN-GRU).

Quasi-oppositional butterfly optimization algorithm (QOBOA) technique is also employed in the study for hyperparameter tuning of DBN and CNN-GRU. The researchers have concluded that EDL-CDSS method has the capability of proficiently detecting the presence of CKD in the IoT environment.

Poonia et al. [ 17 ] employed Various machine learning algorithms, including k-nearest neighbors algorithm (KNN), artificial neural networks (ANN), support vector machines (SVM), naive bayes (NB), and Logistic Regression as well as Re-cursive Feature Elimination (RFE) and Chi-Square test feature-selection techniques. Publicly available dataset of healthy and kidney disease patients were used to build and analyze prediction models. The study found that a logistic regression-based prediction model with optimal features chosen using the Chi-Square technique had the highest accuracy of 98.75%.

Vinod [ 18 ] carried out the assessment of seven supervised machine learning algorithms namely K-Nearest Neighbor, Decision Tree, Support vector Machine, Random Forest, Neural Network, Naïve Bayes and Logistic Regression to find the most suitable model for BCD prediction based on different performance evaluation. Finally, the result showed that k-NN is the best performer on the BCD dataset with 97% accuracy.

The above reviews indicates that several studies have been conducted on chronic kidney disease prediction using machine-learning techniques. There are various parameters which play important role in improving model performance like dataset size, quality of dataset and the time dataset collected. This study focuses on chronic kidney disease prediction using machine learning models based on the dataset with big size and recent than online available dataset collected from St. Paulo’s Hospital in Ethiopia with five classes: notckd, mild, moderate, severe, and ESRD and binary classes: ckd and notckd by applying machine-learning models. Most previously conducted researches focused on two classes, which make treatment recommendations difficult because the type of treatment to be given is based on the stages. Table 1 below shows the summary of some related works.

Materials and method

Data source and description.

The data source for this study is St. Paulo’s Hospital. It is the second-largest public hospital in Ethiopia which admits large number of patients with chronic diseases. There are dialysis treatment and kidney transplant center in the hospital. As it has been shown in Table 2 , the dataset for this study is patients’ records of chronic kidney disease from patients admitted to the renal ward during 2018 to 2019. Some of them were obtained from the same patient history data at different times of different stages. To prepare the dataset and understand features, interviews of domain experts have been conducted. The dataset contains 1718 instances with 19 features where 12 are numeric and 7 are nominal. As the detail have been shown in Table 3 , the features in the dataset include Age, Gender, blood pressure, specific gravity, chloride, sodium, potassium, blood urine nitrogen, serum creatinine, hemoglobin, red blood cell count, white blood cell count, mean cell volume, platelet count, hypertension, diabetic mellitus, anemia and heart disease. When we see multi class distribution, 441 (25.67%) instances are end-stage renal disease stage or stage five, 399 (23.22%) are at a severe stage or stage four, 354 (20.61%) are at a moderate stage or stage three, 248 (14.44%) are at a mild stage or stage two, and 276 (16.07%) have no chronic kidney disease or normal. The class distribution for binary class is 1442 (83.93%) ckd (stage 1 to 5) and 276 (16.07%) notcckd. The binary-class distribution is imbalanced. Oversampling data resampling technique have been used to balance the value of the minority class with the value of the majority class. After employing the resampling technique, the total size of binary class dataset, become 2888.

Preprocessing

Real world data is often inconsistent which can affect the performances of models. Preprocessing the data before it is fed into classifiers is vital part of developing machine-learning model. Similarly, the dataset for this study contains missing values that needs to be handled appropriately. It has to also be in a suitable format for modeling. Hence, pre-processing has been conducted as it has been shown in Fig.  1 .

figure 1

Chronic kidney disease dataset preprocessing steps

Cleaning Noisy Data: removing outlier and smoothening noisy data is an important part of preprocessing. Outliers are values that lies away from the range of the rest of the values. In clinical data, outliers may arise from the natural variance of data. The potential outliers are the data points that fall above Q3 + 1.5(IQR) and below Q1 − 1.5 (IQR), where Q1 is the first quartile, Q3 is the third quartile, and IQR = Q3 − Q1 [ 19 ].

Handling Missing Values: data is not always available (or missed) due to equipment malfunction, inconsistent with other recorded data and thus deleted, not entered into the database due to misunderstanding, some data may not be considered important at the time of entry.

Patient data often has missing diagnostic test results that would help to predict the likelihood of diagnoses or predict treatment’s effectiveness [ 20 ]. The missing values have an impact on the performance of the prediction model. There are several ways of handling missing values including dropping missing values and filling missing values. Sometimes missing values are ignored if they are small percentage i.e., if missing data under 10%. But it is not considered healthy for the model because the missing value can be an important feature contributing to the model development. Sometimes the missing values can be also replaced by zero, which will bring no change to the model. To handle these missing values the mean, an average of the observed features used in this study, because the missing features are numeric and mean imputation is better for numerical missing values.

Handling Categorical Data: In this step, data has been transformed into the required format. The nominal data converted into numerical data of the form 0 and 1. For instance, ‘Gender’ has the nominal value that can be labeled as 0-for female and 1-for male. After preprocessing the data then the resultant CSV file comprises all the integer and float values for different CKD related features.

Normalization: It is important to scale numerical features before fitting to any models, as scaling is mandatory for some techniques such as nearest neighbors, SVMs, and deep learning [ 21 ]. There are different techniques of scaling and in this study Z-score normalization (or zero-mean normalization) have been used. The values for a feature are normalized based on the mean and standard deviation. It is as follows:

where z is Z-score, x is feature value, \(\mu\) is mean value and σ is standard deviation.

Feature selection

Identify subset of relevant predictive features is important for quality result [ 22 ]. Feature selection is the process of selecting most important predictive features to use them as input for models. It is important preprocessing step to deal with the problem of high dimensionality. Hence, the main aim of feature selection is to select the subset of features that are relevant and independent of each other for training the model [ 23 ]. Similarly, feature selection is crucial to develop chronic kidney disease predictive model. This reduces the dimensionality and complexity of the data and makes the model be faster, more effective and accurate. Hence, feature selection algorithm have been used to select relevant features after the construction of the dataset.

Filter, wrappers, or embedded techniques are widely used for feature selection in different clinical datasets including chronic kidney disease. A filter method is a technique that is independent of a classification algorithm and uses general characteristics of data for evaluating and selecting relevant features. The filter method works independently of the learning algorithm to remove irrelevant features and analyses properties of a dataset to chooses relevant features [ 24 ]. It is widely used approach due to its less complex in nature. With wrapper method, relevant features are selected using the classification algorithm. It is better than filter feature selection technique in terms of accuracy. However, it requires higher processing time. In this study, univariate feature selection method from filter methods have been selected because it is fast, efficient and scalable. Recursive feature elimination with cross-validation (RFECV) has been used from wrapper feature selection method.

Univariate Feature Selection (UFS): This method is popular, simplistic and fastest feature selection method used in healthcare dataset. It considers each feature separately to determine the strength of the relationship of the feature with the dependent variable. It is fast, scalable and independent of classifier. Different options are there for univariate algorithms such as Pearson correlation, information gain, chi-square, ANOVA (Analysis of Variance). In this study, feature selection was done using the ANOVA as shown in Eq.  2 .

where F is ANOVA coefficient, MST is mean sum of squares of treatment and MSE is mean sum of squares error.

Recursive Feature Elimination with Cross-Validation (RFECV): An optimization algorithm to develop a trained machine-learning model with relevant and selected features by repeatedly eliminating irrelevant features. It repetitively creates the model, keeps aside the worst performing feature at each iteration, and builds the next model with the remaining features until the features are completed to select best subset of features [ 25 ]. It eliminates the redundant and weak feature whose deletion least affects the training and keeps the independent and strong feature to improve the generalization performance of the model [ 26 ]. This method uses the iterative procedure for feature ranking and to find out the features that have been evaluated as most important. Because this technique work interacting with a machine learning model, it first builds the model on the entire set of features and ranked the feature according to its importance.

Machine learning models

The aim of the study was to predict chronic kidney disease using machine-learning techniques. Three machine learning algorithms; Random Forest, Support Vector Machine and Decision Tree have been used in this study. The algorithms were selected based on their popularity in chronic kidney disease prediction and their performance of classification on previous research works [ 12 , 27 , 28 , 29 , 30 , 31 , 32 , 33 ].

Random Forest: Random Forest is ensemble learning that consists of several collections of decision trees. It is used for both classification and regression. This model comprises of a number of decision trees and outputs the class target that is the highest voting results of the target output by each tree [ 28 ]. Random Forest uses both bagging and random feature selection to build the tree and creates an uncorrelated forest of trees. Its prediction by the group is more accurate than that of any individual tree. After it builds the forest, test instances are permeated down through each tree and trees make their respective prediction of class [ 33 ]. The Random Forest pseudocode is shown in Fig.  2 .

figure 2

Random forest pseudocode [ 18 ]

Support vector machine (SVM): Support Vector Machine is one of the prominent and convenient supervised machine-learning algorithm that can be used for classification, learning and prediction. A set of hyperplanes are built to classify all input in high dimensional data. A discrete hyperplane is created in the signifier space of the training data and compounds are classified based on the side of the hyperplane [ 30 ]. Hyperplanes are decision boundaries that separate the data points. Support vectors are data points that are closer to the hyperplane and determine the position and orientation of the hyperplane. SVMs have been mainly proposed to deal with binary classification, but nowadays many researchers have tried to apply it to multiclass classification because there are a huge amount of data to be classified into more than two classes in today's world. SVM solves multiclass problems through the two most popular approaches; one-versus-rest and one-vs-one. In this study, one-versus-rest has been used. For multi classification, we used OVR with the SVM algorithm. This method separates each class from the rest of the classes in the dataset. Besides, because Linear SVC is used in this study, one vs rest is an appropriate method with Linear SVC. The pseudocode of SVM have been shown in Fig.  3 .

figure 3

SVM pseudocode [ 18 ]

Decision Tree (DT): It is one of the most popular supervised machine-learning algorithms that can be used for classification. Decision Tree solves the problem of machine learning by transforming the data into a tree representation through sorted feature values. Each node in a decision tree denotes features in an instance to be classified, and each leaf node represents a class label the instances belong to. This model uses a tree structure to split the dataset based on the condition as a predictive model that maps observations about an item to make a decision on the target value of instances [ 34 ]. Decision Tree pseudocode is shown in Fig.  4 and decision making in binary class of chronic kidney disease have been shown in Fig.  5 .

figure 4

Decision tree pseudocode [ 18 ]

figure 5

Decision making in binary class of chronic kidney disease

Figure  6 shows the flow of model building using three machine-learning algorithms with tenfold cross validation. The machine learning models were developed for both multiclass and binary classification. The best performing model was selected as best machine learning model from the three algorithms for each classification.

figure 6

Model building flow diagram

Prediction model evaluation

Performance evaluation is the critical step of developing an accurate machine-learning model. Prediction model shall to be evaluated to ensure that the model fits the dataset and work well on unseen data. The aim of the performance evaluation is to estimate the generalization accuracy of a model on unseen/out-of-sample data. Cross-Validation (CV) is one of the performance evaluation methods for evaluating and comparing models by dividing data into partitions. The original dataset was partitioned into k equal size subsamples called folds: nine used to train a model and one used to test or validate the model. This process repeated k times and the average performance will be taken. Tenfold cross-validation have been used in this study. Different performance evaluation metrics including accuracy, precision, recall, f1-score, sensitivity, specificity have been computed.

True positive (TP): are the condition when both actual value and predicted value are positive.

True negative (TN): are the condition when both the actual value of the data point and the predicted are negative.

False positive (FP): These are the cases when the actual value of the data point was negative and the predicted is positive.

False negative (FN): are the cases when the actual value of the data point is positive and the predicted is negative.

Accuracy implies the ability of the classification algorithm to predict the classes of the dataset correctly. It is the measure of how close or near the predicted value is to the actual or theoretical value [ 35 ]. Generally, accuracy is the measure of the ratio of correct predictions over the total number of instances. The equation of accuracy is shown in Eq.  3 .

Precision measure the true values correctly predicted from the total predicted values in the actual class. Precision quantifies the ability of the classifiers to not label a negative example as positive. The equation of precision is shown in Eq.  4 .

Macro average is used for multiclass classification because it gives equal weight for each class. The equation of macro average precision is shown in Eq.  5 .

Recall measure the rate of positive values that are correctly classified. Recall answers the question of what proportion of actual positives are correctly classified. The equation of recall is shown in Eq.  6 .

Since the macro average is used in order to compute the recall value of the models, macro average recall is calculated as follows (Eq.  7 ).

F-measure is also called F1-score is the harmonic mean between recall and precision. The equation of F1-score is shown in Eq.  8 .

The macro average of F1_score is calculated as follows (Eq.  9 ).

Sensitivity

Sensitivity is also called True Positive Rate. Sensitivity is the mean proportion of actual true positives that are correctly identified [ 36 ]. The equation of sensitivity is shown in Eq.  10 .

Specificity

Specificity is also called True Negative Rate. It is used to measure the fraction of negative values that are correctly classified. The equation of sensitivity is shown in Eq.  11 .

Results and discussions

The features selection process using the two methods; UFS and RFECV resulted two different sets of features. The resulted subset of features have been used in the training of RF, SVM, and DT. The selected features for both five-class and binary-class are different because recursive feature elimination with cross-validation can automatically eliminate less predictive features iteratively depending on the model. Table 4 shows the number of selected features of binary-class and five-class that set the size of the dataset respectively.

Evaluation results

The Experiment has been carried out based on two feature selection methods and three classifiers for both binary and five-class classification for 18 models. Training and testing on these models have been executed using tenfold cross-validation. In tenfold cross-validation, the dataset is partitioned randomly into ten equal size sets. Training the models was then done by using 10–1 folds and test using one remaining fold. The process is iterative for each fold. The obtained results are presented in binary and five-class classification models. Modeling was first carried out for both binary and five class classification using preprocessed dataset without applying feature selection methods. Then, modeling has been experimented by applying the two feature selection methods as it has been discussed in the following sections.

Binary classification models evaluation results

These classification models were built using the two-class dataset that was converted from the five-class dataset. The target classes are notckd and ckd. The models are trained and tested using tenfold CV and other performance evaluation metrics of CV. As it has been discussed previously, modeling was first conducted on the preprocessed dataset without applying feature selection methods. Then, feature selection techniques have been implemented to select the most predictive features. Feature selection was implemented using UFS and RFECV. The performance measures of each test for RF, SVM, and DT models before feature selection and after feature selection method are presented in Table 5 .

An accuracy of 99.8% resulted in the RF with RFECV model with selected 8 features is the highest. The result of models before applying feature selection is graphically shown in Fig.  7 .

figure 7

Binary class classification without feature selection

Furtherly, we have carried out hyperparameter optimization using grid search using cross validation on SVM for binary dataset. This has been done without feature selection. The performance has been significantly improved to 99.83% which is almost the same with the highest RF with RFECV. Another experiment on the binary classification is with Extreme Gradient Boosting (XGBoost). It is a powerful machine learning algorithm that can be used to solve classification and regression problems. The performance is 98.96% which is not better than RF with RFECV’s performance.

Multiclass classification models evaluation results

The multiclass models were similarly built using the preprocessed five-class dataset. The models are trained and tested using tenfold CV and evaluated with other performance evaluation metrics. The performance metrics result of each trained model; RF, SVM, and DT have been presented for without and with feature selection method as shown in Table 6 . Models were first trained and tested with all features without applying feature selection methods and then we apply the feature selection methods.

Table 6 shows the CV performance metrics results for three classifiers of multiclass dataset before and after applying feature selection. The accuracy is 79% from RF with RFECV with selected 9 features. The result of models after applying feature selection is graphically shown in Fig.  8 . Similarly, we have carried out hyperparameter optimization using grid search using cross validation on SVM for multiclass dataset. This has been done without feature selection. The performance has been significantly improved to 78.78% which is not better than the highest performing model RF with RFECV. Another experiment on the multiclass classification is with Extreme Gradient Boosting (XGBoost). It is a powerful machine learning algorithm that can be used to solve classification and regression problems. The performance is 82.56% which is better than RF with RFECV’s performance.

figure 8

Multiclass classification results after applying RFECV

Discussions

Chronic kidney disease is a global health threat and becoming silent killer in Ethiopia [ 6 ]. Many people die or suffer severely by the disease mainly due to lack of awareness about the disease and inability to detect early. Thus, early prediction of chronic kidney disease is believed to be helpful to slow the progress of the disease. Machine learning plays a vital role in early disease identification and prediction. It supports the decision of medical experts by enabling them to diagnose the disease fast and accurately.

In this study, chronic kidney disease prediction has been carried out using machine learning techniques. The datasets consists of 19 features with numerical and nominal values along with the class to which each instance belongs. The dataset had missing values and these missing values were handled at the preprocessing step. After preprocessing, we have prepared two datasets for two tasks; binary classification and multiclass classification that has five-classes. The dataset comprises 1718 instances. The binary class dataset was created by converting the five-class dataset to two class labels; notckd and ckd. All classes except notckd converted to ckd in the binary class dataset. After preparing the dataset to binary classification, it has been observed that the class labels were imbalanced and we applied the data resampling technique to balance the dataset.

Three machine-learning models and two feature selection methods were used. The model evaluation was done using tenfold cross-validation and other performance evaluation metrics such as precision, recall, F1-score, sensitivity, and specificity. The confusion matrix was used to show the correctly and incorrectly classified classes to evaluate the performance of the model. Initially, we applied machine learning classifiers without feature selection for both binary class and five classes. The machine learning models used in this study are Random Forest, Support Vector Machine, and Decision tree. Then, feature selection techniques were applied along with the models in order to select predictive. RFECV and UFS were the two feature selection methods applied to select the relevant features. UFS is the filtering method, it works independently of the machine learning model and RFECV is the wrapper method that works depending on the machine-learning algorithm. RFECV is also automatic feature selection method that can select features without specifying the number of features to be selected. The number of selected features is different from model to model as it has been shown in the results. There were features that has been frequently selected in each feature sets, which indicates that they are the most predictive features and have strong predictive relation to the class. The most frequently selected features using both feature selection methods in all models are serum creatinine (Scr), blood urine Nitrogen (Bun), Hemoglobin (Hgb), and Specific Gravity (Sg). Pltc, Rbcc, Wbcc, Mcv, Dm, and Htn are the next most frequently selected features.

Several related studies have been conducted to predict chronic kidney disease using machine learning and other techniques globally. However, there were very limited works in the context of Ethiopia. The focuses of global studies were mainly on binary classification. There were very limited attempt on multiclass classification. Additionally, the size of the datasets used were small which makes them susceptible for overfitting and difficult for comparison.

Salekin and Stankovic [ 9 ] proposed the method of detecting chronic kidney disease using K-NN, RF and NN, analysed the characteristics of 24 features, and sorted their predictability. They have used the dataset of 400 instances comprising 250 ckd and 150 notckd. They have used five features for final model construction and they have evaluated the performance of their model using tenfold cross-validation. Compared to the proposed solution in this study, this work used small size dataset and only focus on binary classification (ckd and notckd). Even though dataset used in the previous work and dataset used in proposed study are not the same, the proposed model in this study has higher performance compared to the previous work.

In the study conducted by Almasoud and Ward [ 13 ] Logistic regression, support vector machines, random forest, and gradient boosting algorithms and feature selection methods such as the ANOVA test, the Pearson’s correlation, and the Cramer’s V test were implemented to detect chronic kidney disease. A dataset with 400 instances has been used. Compared to the proposed model performance of binary class classification, the performance of the model in this work is inferior. The kidney disease stages prediction model was developed by Rady and Anwar using PNN, MLP, SVM and RBF [ 15 ]. They evaluated and compared the performance of the models with accuracy. They reported accuracy accuracy result were 96.7% for PNN, 87% for RBF, 60.7% for SVM and 51.5% for MLP. They used dataset of 361 instances with 25 features including the target class. The total data used in this work has been the data of patients with ckd and then they calculated the eGFR to identify stages of the disease. PNN with an accuracy of 96.7% and RBF with an accuracy of 87% developed by Rady and Anwar [ 15 ] shows better performance than the model with best performance in this study which is RF based on RFECV with accuracy of 79% even though models in [ 15 ] and two models build in this study are different. However, the SVM model built in this study perform better than the SVM model built by Rady and Anwar [ 15 ]

In this study, proposed models were evaluated using tenfold cross-validation and other performance evaluation metrics such as precision, recall, f1_scor, sensitivity, and specificity. Four machine-learning models; RF, SVM, DT and Extreme Gradient Boosting (XGBoost) have been implemented before applying feature selection. SVM resulted the highest accuracy of 99.8% for binary. Its accuracy is 78.78% for five classes. RF resulted accuracy of 99.7% for binary class and accuracy of 78.3% for five-class. XGBoost resulted accuracy of 98.96% for binary class. The accuracy of XGBoost is 82.56% for five-class which is the highest. DT resulted accuracy of 98.5% for binary class and accuracy of 77.5% for five-class. Then, the two feature selection methods were applied to the two datasets. Both SVM and RF with RFECV resulted the highest accuracy for the binary class. XGBoost has the highest accuracy of 82.56% for five-class. The result is promising and we believe that it can be deployed to support medical experts to identify the disease fast and accurate. Thus, SVM, RF with RFECV and XGBoost is recommended in our study based on its accuracy, f1score, and other performance evaluation for binary class and five-class classifications.

Early prediction is very crucial for both experts and patients to prevent and slow down the progress of chronic kidney disease to kidney failure. In this study three machine-learning models RF, SV, DT, and two feature selection methods RFECV and UFS were used to build proposed models. The evaluation of models were done using tenfold cross-validation. First, the four machine learning algorithms were applied to original datasets with all 19 features. Applying the models on the original dataset, we have got the highest accuracy with RF, SVM, and XGBoost. The accuracy was 99.8% for the binary class and 82.56% for five-class. DT produced lowest performance compared to RF. RF also produced the highest f1_score values. SVM and RF with RFECV produced the highest accuracy of 99.8%for binary class. XGBoost has 82.56% accuracy for five-class datasets which is the highest. Hencewe believe that multi classification work was very important to know the stages of the disease and suggest needed treatments for the patients in order to save their lives.

Future works

This study used a supervised machine-learning algorithm, feature selection methods to select the best subset features to develop the models. It is better to see the difference in performance results using unsupervised or deep learning algorithms models. The proposed model supports the experts to give the fast decision, it is better to make it a mobile-based system that enables the experts to follow the status of the patients and help the patients to use the system to know their status.

Availability of data and materials

The data to conduct this research was collected from the patient history data attends their treatment or died during a period 2018 to 2019 from St. Paulo’s Hospital. There are no restrictions on the availability of data and the authors are willing to provide the code as well.

Radhakrishnan J, Mohan S. KI Reports and World Kidney Day. Kidney Int Reports. 2017;2(2):125–6.

Article   Google Scholar  

George C, Mogueo A, Okpechi I, Echouffo-Tcheugui JB, Kengne AP. Chronic kidney disease in low-income to middle-income countries: The case f increased screening. BMJ Glob Heal. 2017;2(2):1–10.

Google Scholar  

Ethiopia: kidney disease. https://www.worldlifeexpectancy.com/ethiopia-kidney-disease . Accessed 07 Feb 2020.

Stanifer JW, et al. The epidemiology of chronic kidney disease in sub-Saharan Africa: A systematic review and meta-analysis. Lancet Glob Heal. 2014;2(3):e174–81.

AbdElhafeez S, Bolignano D, D’Arrigo G, Dounousi E, Tripepi G, Zoccali C. Prevalence and burden of chronic kidney disease among the general population and high-risk groups in Africa: A systematic review. BMJ Open. 2018;8:1.

Molla MD, et al. Assessment of serum electrolytes and kidney function test for screening of chronic kidney disease among Ethiopian Public Health Institute staff members, Addis Ababa, Ethiopia. BMC Nephrol. 2020;21(1):494.

Agrawal A, Agrawal H, Mittal S, Sharma M. Disease Prediction Using Machine Learning. SSRN Electron J. 2018;5:6937–8.

Charleonnan A, Fufaung T, Niyomwong T, Chokchueypattanakit W, Suwannawach S, Ninchawee N. Predictive analytics for chronic kidney disease using machine learning techniques. Manag Innov Technol Int Conf MITiCON. 2016;80–83:2017.

Salekin A, Stankovic J. Detection of Chronic Kidney Disease and Selecting Important Predictive Attributes. In: Proc. - 2016 IEEE Int. Conf. Healthc. Informatics, ICHI 2016, pp. 262–270, 2016.

Tekale S, Shingavi P, Wandhekar S, Chatorikar A. Prediction of chronic kidney disease using machine learning algorithm. Disease. 2018;7(10):92–6.

Xiao J, et al. Comparison and development of machine learning tools in the prediction of chronic kidney disease progression. J Transl Med. 2019;17(1):1–13.

Priyanka K, Science BC. Chronic kidney disease prediction based on naive Bayes technique. 2019. p. 1653–9.

Almasoud M, Ward TE. Detection of chronic kidney disease using machine learning algorithms with least number of predictors. Int J Adv Computer. 2019;10(8):89–96.

Yashfi SY. Risk Prediction Of Chronic Kidney Disease Using Machine Learning Algorithms. 2020.

Rady EA, Anwar AS. Informatics in Medicine Unlocked Prediction of kidney disease stages using data mining algorithms. Informatics Med. 2019;15(2018):100178.

Alsuhibany SA, et al. Ensemble of deep learning based clinical decision support system for chronic kidney disease diagnosis in medical internet of things environment. Comput Intell Neurosci. 2021;3:2021.

Poonia RC, et al. Intelligent Diagnostic Prediction and Classification Models for Detection of Kidney Disease. Healthcare. 2022;10:2.

Kumar V. Evaluation of computationally intelligent techniques for breast cancer diagnosis. Neural Comput Appl. 2021;33(8):3195–208.

Jasim A, Kaky M. Iintelligent systems approach for classification and management of by. 2017.

Saar-tsechansky M, ProvostF. Handling Missing Values when Applying Classi … cation Models. vol. 1, 2007.

Data Preparation for Statistical Modeling and Machine Learning. https://www.featureranking.com/tutorials/machine-learning-tutorials/data-preparation-for-machine-learning/ . Accessed 12 Oct 2020.

Oliver T. Machine Learning For Absolute Beginners. 2017.

ZarPhyu T, Oo NN. Performance comparison of feature selection methods. MATEC Web Conf. 2016;42:2–5.

Koshy S. Feature selection for improving multi-label classification using MEKA. Res J. 2017;12(24):14774–82.

Vidhya A. Introduction to Feature Selection methods with an example (or how to select the right variables?). https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/ . Accessed 24 Mar 2020.

Misra P, Yadav AS. Improving the classification accuracy using recursive feature elimination with cross-validation. Int J Emerg Technol. 2020;11(3):659–65.

Aqlan F, Markle R, Shamsan A. Data mining for chronic kidney disease prediction. 67th Annu Conf Expo Inst Ind Eng. 2017;2017:1789–94.

Subas A, Alickovic E, Kevric J. Diagnosis of chronic kidney disease by using random forest. IFMBE Proc. 2017;62(1):589–94.

Kapoor S, Verma R, Panda SN. Detecting kidney disease using Naïve bayes and decision tree in machine learning. Int J Innov Technol Explor Eng. 2019;9(1):498–501.

Vijayarani S, Dhayanand S. Data Mining Classification Algorithms for Kidney Disease Prediction. Int J Cybern Informatics. 2015;4(4):13–25.

Drall S, Drall GS, Singh S. Chronic kidney disease prediction using machine learning : a new approach bharat Bhushan Naib. Learn. 2014;8(278):278–87.

KadamVinay R, Soujanya KLS, Singh P. Disease prediction by using deep learning based on patient treatment history. Int J Recent Technol Eng. 2019;7(6):745–54.

Ramya S, Radha N. Diagnosis of Chronic Kidney Disease Using. pp. 812–820, 2016.

Osisanwo FY, Akinsola JET, Awodele O, Hinmikaiye JO, Olakanmi O, Akinjobi J. Supervised machine learning algorithms: classification and comparison. Int J Comput Trends Technol. 2017;48(3):128–38.

Acharya A. Comparative Study of Machine Learning Algorithms for Heart Disease Prediction 2017.

Amirgaliyev Y. Analysis of chronic kidney disease dataset by applying machine learning methods. In: 2018 IEEE 12th International Conference Application Information Communication Technology, pp. 1–4, 2010.

Download references

Acknowledgements

Not applicable.

Author information

Authors and affiliations.

Department of Information Science, College of Computing, Madda Walabu University, Robe, Ethiopia

Dibaba Adeba Debal

Department of Computer Science and Engineering, School of Electrical Engineering and Computing, Adama Science and Technology University, Adama, Ethiopia

Tilahun Melak Sitote

You can also search for this author in PubMed   Google Scholar

Contributions

The development of the basic research questions, identifying the problems and selecting appropriate machine learning algorithms, data collection, data analysis, interpretation, and critical review of the paper have been done by DA. The edition of the overall progress of the work was supported and guided by TM. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Dibaba Adeba Debal .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Debal, D.A., Sitote, T.M. Chronic kidney disease prediction using machine learning techniques. J Big Data 9 , 109 (2022). https://doi.org/10.1186/s40537-022-00657-5

Download citation

Received : 31 January 2022

Accepted : 24 October 2022

Published : 20 November 2022

DOI : https://doi.org/10.1186/s40537-022-00657-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Chronic Kidney Disease (CKD)
  • Machine Learning
  • Random Forest (RF)
  • Support Vector Machine (SVM)

disease prediction using machine learning research paper

  • Research article
  • Open access
  • Published: 21 December 2019

Comparing different supervised machine learning algorithms for disease prediction

  • Shahadat Uddin   ORCID: orcid.org/0000-0003-0091-6919 1 ,
  • Arif Khan 1 , 2 ,
  • Md Ekramul Hossain 1 &
  • Mohammad Ali Moni 3  

BMC Medical Informatics and Decision Making volume  19 , Article number:  281 ( 2019 ) Cite this article

122k Accesses

690 Citations

13 Altmetric

Metrics details

Supervised machine learning algorithms have been a dominant method in the data mining field. Disease prediction using health data has recently shown a potential application area for these methods. This study aims to identify the key trends among different types of supervised machine learning algorithms, and their performance and usage for disease risk prediction.

In this study, extensive research efforts were made to identify those studies that applied more than one supervised machine learning algorithm on single disease prediction. Two databases (i.e., Scopus and PubMed) were searched for different types of search items. Thus, we selected 48 articles in total for the comparison among variants supervised machine learning algorithms for disease prediction.

We found that the Support Vector Machine (SVM) algorithm is applied most frequently (in 29 studies) followed by the Naïve Bayes algorithm (in 23 studies). However, the Random Forest (RF) algorithm showed superior accuracy comparatively. Of the 17 studies where it was applied, RF showed the highest accuracy in 9 of them, i.e., 53%. This was followed by SVM which topped in 41% of the studies it was considered.

This study provides a wide overview of the relative performance of different variants of supervised machine learning algorithms for disease prediction. This important information of relative performance can be used to aid researchers in the selection of an appropriate supervised machine learning algorithm for their studies.

Peer Review reports

Machine learning algorithms employ a variety of statistical, probabilistic and optimisation methods to learn from past experience and detect useful patterns from large, unstructured and complex datasets [ 1 ]. These algorithms have a wide range of applications, including automated text categorisation [ 2 ], network intrusion detection [ 3 ], junk e-mail filtering [ 4 ], detection of credit card fraud [ 5 ], customer purchase behaviour detection [ 6 ], optimising manufacturing process [ 7 ] and disease modelling [ 8 ]. Most of these applications have been implemented using supervised variants [ 4 , 5 , 8 ] of the machine learning algorithms rather than unsupervised ones. In the supervised variant, a prediction model is developed by learning a dataset where the label is known and accordingly the outcome of unlabelled examples can be predicted [ 9 ].

The scope of this research is primarily on the performance analysis of disease prediction approaches using different variants of supervised machine learning algorithms. Disease prediction and in a broader context, medical informatics, have recently gained significant attention from the data science research community in recent years. This is primarily due to the wide adaptation of computer-based technology into the health sector in different forms (e.g., electronic health records and administrative data) and subsequent availability of large health databases for researchers. These electronic data are being utilised in a wide range of healthcare research areas such as the analysis of healthcare utilisation [ 10 ], measuring performance of a hospital care network [ 11 ], exploring patterns and cost of care [ 12 ], developing disease risk prediction model [ 13 , 14 ], chronic disease surveillance [ 15 ], and comparing disease prevalence and drug outcomes [ 16 ]. Our research focuses on the disease risk prediction models involving machine learning algorithms (e.g., support vector machine, logistic regression and artificial neural network), specifically - supervised learning algorithms. Models based on these algorithms use labelled training data of patients for training [ 8 , 17 , 18 ]. For the test set, patients are classified into several groups such as low risk and high risk.

Given the growing applicability and effectiveness of supervised machine learning algorithms on predictive disease modelling, the breadth of research still seems progressing. Specifically, we found little research that makes a comprehensive review of published articles employing different supervised learning algorithms for disease prediction. Therefore, this research aims to identify key trends among different types of supervised machine learning algorithms, their performance accuracies and the types of diseases being studied. In addition, the advantages and limitations of different supervised machine learning algorithms are summarised. The results of this study will help the scholars to better understand current trends and hotspots of disease prediction models using supervised machine learning algorithms and formulate their research goals accordingly.

In making comparisons among different supervised machine learning algorithms, this study reviewed, by following the PRISMA guidelines [ 19 ], existing studies from the literature that used such algorithms for disease prediction. More specifically, this article considered only those studies that used more than one supervised machine learning algorithm for a single disease prediction in the same research setting. This made the principal contribution of this study (i.e., comparison among different supervised machine learning algorithms) more accurate and comprehensive since the comparison of the performance of a single algorithm across different study settings can be biased and generate erroneous results [ 20 ].

Traditionally, standard statistical methods and doctor’s intuition, knowledge and experience had been used for prognosis and disease risk prediction. This practice often leads to unwanted biases, errors and high expenses, and negatively affects the quality of service provided to patients [ 21 ]. With the increasing availability of electronic health data, more robust and advanced computational approaches such as machine learning have become more practical to apply and explore in disease prediction area. In the literature, most of the related studies utilised one or more machine learning algorithms for a particular disease prediction. For this reason, the performance comparison of different supervised machine learning algorithms for disease prediction is the primary focus of this study.

In the following sections, we discuss different variants of supervised machine learning algorithm, followed by presenting the methods of this study. In the subsequent sections, we present the results and discussion of the study.

  • Supervised machine learning algorithm

At its most basic sense, machine learning uses programmed algorithms that learn and optimise their operations by analysing input data to make predictions within an acceptable range. With the feeding of new data, these algorithms tend to make more accurate predictions. Although there are some variations of how to group machine learning algorithms they can be divided into three broad categories according to their purposes and the way the underlying machine is being taught. These three categories are: supervised, unsupervised and semi-supervised.

In supervised machine learning algorithms, a labelled training dataset is used first to train the underlying algorithm. This trained algorithm is then fed on the unlabelled test dataset to categorise them into similar groups. Using an abstract dataset for three diabetic patients, Fig.  1 shows an illustration about how supervised machine learning algorithms work to categorise diabetic and non-diabetic patients. Supervised learning algorithms suit well with two types of problems: classification problems; and regression problems. In classification problems, the underlying output variable is discrete. This variable is categorised into different groups or categories, such as ‘red’ or ‘black’, or it could be ‘diabetic’ and ‘non-diabetic’. The corresponding output variable is a real value in regression problems, such as the risk of developing cardiovascular disease for an individual. In the following subsections, we briefly describe the commonly used supervised machine learning algorithms for disease prediction.

figure 1

An illustration of how supervised machine learning algorithms work to categorise diabetic and non-diabetic patients based on abstract data

Logistic regression

Logistic regression (LR) is a powerful and well-established method for supervised classification [ 22 ]. It can be considered as an extension of ordinary regression and can model only a dichotomous variable which usually represents the occurrence or non-occurrence of an event. LR helps in finding the probability that a new instance belongs to a certain class. Since it is a probability, the outcome lies between 0 and 1. Therefore, to use the LR as a binary classifier, a threshold needs to be assigned to differentiate two classes. For example, a probability value higher than 0.50 for an input instance will classify it as ‘class A’; otherwise, ‘class B’. The LR model can be generalised to model a categorical variable with more than two values. This generalised version of LR is known as the multinomial logistic regression.

Support vector machine

Support vector machine (SVM) algorithm can classify both linear and non-linear data. It first maps each data item into an n-dimensional feature space where n is the number of features. It then identifies the hyperplane that separates the data items into two classes while maximising the marginal distance for both classes and minimising the classification errors [ 23 ]. The marginal distance for a class is the distance between the decision hyperplane and its nearest instance which is a member of that class. More formally, each data point is plotted first as a point in an n-dimension space (where n is the number of features) with the value of each feature being the value of a specific coordinate. To perform the classification, we then need to find the hyperplane that differentiates the two classes by the maximum margin. Figure  2 provides a simplified illustration of an SVM classifier.

figure 2

A simplified illustration of how the support vector machine works. The SVM has identified a hyperplane (actually a line) which maximises the separation between the ‘star’ and ‘circle’ classes

Decision tree

Decision tree (DT) is one of the earliest and prominent machine learning algorithms. A decision tree models the decision logics i.e., tests and corresponds outcomes for classifying data items into a tree-like structure. The nodes of a DT tree normally have multiple levels where the first or top-most node is called the root node. All internal nodes (i.e., nodes having at least one child) represent tests on input variables or attributes. Depending on the test outcome, the classification algorithm branches towards the appropriate child node where the process of test and branching repeats until it reaches the leaf node [ 24 ]. The leaf or terminal nodes correspond to the decision outcomes. DTs have been found easy to interpret and quick to learn, and are a common component to many medical diagnostic protocols [ 25 ]. When traversing the tree for the classification of a sample, the outcomes of all tests at each node along the path will provide sufficient information to conjecture about its class. An illustration of an DT with its elements and rules is depicted in Fig.  3 .

figure 3

An illustration of a Decision tree. Each variable (C1, C2, and C3) is represented by a circle and the decision outcomes (Class A and Class B) are shown by rectangles. In order to successfully classify a sample to a class, each branch is labelled with either ‘True’ or ‘False’ based on the outcome value from the test of its ancestor node

Random forest

A random forest (RF) is an ensemble classifier and consisting of many DTs similar to the way a forest is a collection of many trees [ 26 ]. DTs that are grown very deep often cause overfitting of the training data, resulting a high variation in classification outcome for a small change in the input data. They are very sensitive to their training data, which makes them error-prone to the test dataset. The different DTs of an RF are trained using the different parts of the training dataset. To classify a new sample, the input vector of that sample is required to pass down with each DT of the forest. Each DT then considers a different part of that input vector and gives a classification outcome. The forest then chooses the classification of having the most ‘votes’ (for discrete classification outcome) or the average of all trees in the forest (for numeric classification outcome). Since the RF algorithm considers the outcomes from many different DTs, it can reduce the variance resulted from the consideration of a single DT for the same dataset. Figure  4 shows an illustration of the RF algorithm.

figure 4

An illustration of a Random forest which consists of three different decision trees. Each of those three decision trees was trained using a random subset of the training data

Naïve Bayes

Naïve Bayes (NB) is a classification technique based on the Bayes’ theorem [ 27 ]. This theorem can describe the probability of an event based on the prior knowledge of conditions related to that event. This classifier assumes that a particular feature in a class is not directly related to any other feature although features for that class could have interdependence among themselves [ 28 ]. By considering the task of classifying a new object (white circle) to either ‘green’ class or ‘red’ class, Fig.  5 provides an illustration about how the NB technique works. According to this figure, it is reasonable to believe that any new object is twice as likely to have ‘green’ membership rather than ‘red’ since there are twice as many ‘green’ objects (40) as ‘red’. In the Bayesian analysis, this belief is known as the prior probability. Therefore, the prior probabilities of ‘green’ and ‘red’ are 0.67 (40 ÷ 60) and 0.33 (20 ÷ 60), respectively. Now to classify the ‘white’ object, we need to draw a circle around this object which encompasses several points (to be chosen prior) irrespective of their class labels. Four points (three ‘red’ and one ‘green) were considered in this figure. Thus, the likelihood of ‘white’ given ‘green’ is 0.025 (1 ÷ 40) and the likelihood of ‘white’ given ‘red’ is 0.15 (3 ÷ 20). Although the prior probability indicates that the new ‘white’ object is more likely to have ‘green’ membership, the likelihood shows that it is more likely to be in the ‘red’ class. In the Bayesian analysis, the final classifier is produced by combining both sources of information (i.e., prior probability and likelihood value). The ‘multiplication’ function is used to combine these two types of information and the product is called the ‘posterior’ probability. Finally, the posterior probability of ‘white’ being ‘green’ is 0.017 (0.67 × 0.025) and the posterior probability of ‘white’ being ‘red’ is 0.049 (0.33 × 0.15). Thus, the new ‘white’ object should be classified as a member of the ‘red’ class according to the NB technique.

figure 5

An illustration of the Naïve Bayes algorithm. The ‘white’ circle is the new sample instance which needs to be classified either to ‘red’ class or ‘green’ class

K-nearest neighbour

The K-nearest neighbour (KNN) algorithm is one of the simplest and earliest classification algorithms [ 29 ]. It can be thought a simpler version of an NB classifier. Unlike the NB technique, the KNN algorithm does not require to consider probability values. The ‘ K ’ is the KNN algorithm is the number of nearest neighbours considered to take ‘vote’ from. The selection of different values for ‘ K ’ can generate different classification results for the same sample object. Figure  6 shows an illustration of how the KNN works to classify a new object. For K = 3 , the new object (star) is classified as ‘black’; however, it has been classified as ‘red’ when K = 5 .

figure 6

A simplified illustration of the K-nearest neighbour algorithm. When K = 3, the sample object (‘star’) is classified as ‘black’ since it gets more ‘vote’ from the ‘black’ class. However, for K = 5 the same sample object is classified as ‘red’ since it now gets more ‘vote’ from the ‘red’ class

Artificial neural network

Artificial neural networks (ANNs) are a set of machine learning algorithms which are inspired by the functioning of the neural networks of human brain. They were first proposed by McCulloch and Pitts [ 30 ] and later popularised by the works of Rumelhart et al. in the 1980s [ 31 ].. In the biological brain, neurons are connected to each other through multiple axon junctions forming a graph like architecture. These interconnections can be rewired (e.g., through neuroplasticity) that helps to adapt, process and store information. Likewise, ANN algorithms can be represented as an interconnected group of nodes. The output of one node goes as input to another node for subsequent processing according to the interconnection. Nodes are normally grouped into a matrix called layer depending on the transformation they perform. Apart from the input and output layer, there can be one or more hidden layers in an ANN framework. Nodes and edges have weights that enable to adjust signal strengths of communication which can be amplified or weakened through repeated training. Based on the training and subsequent adaption of the matrices, node and edge weights, ANNs can make a prediction for the test data. Figure  7 shows an illustration of an ANN (with two hidden layers) with its interconnected group of nodes.

figure 7

An illustration of the artificial neural network structure with two hidden layers. The arrows connect the output of nodes from one layer to the input of nodes of another layer

Data source and data extraction

Extensive research efforts were made to identify articles employing more than one supervised machine learning algorithm for disease prediction. Two databases were searched (October 2018): Scopus and PubMed. Scopus is an online bibliometric database developed by Elsevier. It has been chosen because of its high level of accuracy and consistency [ 32 ]. PubMed is a free publication search engine and incorporates citation information mostly for biomedical and life science literature. It comprises more than 28 million citations from MEDLINE, life science journals and online books [ 33 ]. MEDLINE is a bibliographic database that includes bibliographic information for articles from academic journals covering medicine, nursing, pharmacy, dentistry, veterinary medicine, and health care [ 33 ].

A comprehensive search strategy was followed to find out all related articles. The search terms that were used in this search strategy were:

“disease prediction” AND “machine learning”;

“disease prediction” AND “data mining”;

“disease risk prediction” AND “machine learning”; and

“disease risk prediction” AND “data mining”.

In scientific literature, the generic name of “machine learning” is often used for both “supervised” and “unsupervised” machine learning algorithms. On the other side, there is a close relationship between the terms “machine learning” and “data mining”, with the latter is commonly used for the former one [ 34 ]. For these reasons, we used both “machine learning” and “data mining” in the search terms although the focus of this study is on the supervised machine learning algorithm. The four search items were then considered to launch searches on the titles, abstracts and keywords of an article for both Scopus and PubMed. This resulted in 305 and 83 articles from Scopus and PubMed, respectively. After combining these two lists of articles and removing the articles written in languages other than English, we found 336 unique articles.

Since the aim of this study was to compare the performance of different supervised machine learning algorithms, the next step was to select the articles from these 336 which used more than one supervised machine learning algorithm for disease prediction. For this reason, we wrote a computer program using Python programming language [ 35 ] which checked the presence of the name of more than one supervised machine learning algorithm in the title, abstract and keyword list of each of 336 articles. It found 55 articles that used more than one supervised machine learning algorithm for the prediction of different diseases. Out of the remaining 281 articles, only 155 used one of the seven supervised machine learning algorithms considered in this study. The rest 126 used either other machine learning algorithms (e.g., unsupervised or semi-supervised) or data mining methods other than machine learning ones. ANN was found most frequently (30.32%) in the 155 articles, followed by the Naïve Bayes (19.35%).

The next step is the manual inspection of all recovered articles. We noticed that four groups of authors reported their study results in two publication outlets (i.e., book chapter, conference and journal) using the same or different titles. For these four publications, we considered the most recent one. We further excluded another three articles since the reported prediction accuracies for all supervised machine learning algorithms used in those articles are the same. For each of the remaining 48 articles, the performance outcomes of the supervised machine learning algorithms that were used for disease prediction were gathered. Two diseases were predicted in one article [ 17 ] and two algorithms were found showing the best accuracy outcomes for a disease in one article [ 36 ]. In that article, five different algorithms were used for prediction analysis. The number of publications per year has been depicted in Fig.  8 . The overall data collection procedure along with the number of articles selected for different diseases has been shown in Fig.  9 .

figure 8

Number of articles published in different years

figure 9

The overall data collection procedure. It also shows the number of articles considered for each disease

Figure  10 shows a comparison of the composition of initially selected 329 articles regarding the seven supervised machine learning algorithms considered in this study. ANN shows the highest percentage difference (i.e., 16%) between the 48 selected articles of this study and initially selected 155 articles that used only one supervised machine learning algorithm for disease prediction, which is followed by LR. The remaining five supervised machine learning algorithms show a percentage difference between 1 and 5.

figure 10

Composition of initially selected 329 articles with respect to the seven supervised learning algorithms

Classifier performance index

The diagnostic ability of classifiers has usually been determined by the confusion matrix and the receiver operating characteristic (ROC) curve [ 37 ]. In the machine learning research domain, the confusion matrix is also known as error or contingency matrix. The basic framework of the confusion matrix has been provided in Fig.  11 a. In this framework, true positives (TP) are the positive cases where the classifier correctly identified them. Similarly, true negatives (TN) are the negative cases where the classifier correctly identified them. False positives (FP) are the negative cases where the classifier incorrectly identified them as positive and the false negatives (FN) are the positive cases where the classifier incorrectly identified them as negative. The following measures, which are based on the confusion matrix, are commonly used to analyse the performance of classifiers, including those that are based on supervised machine learning algorithms.

figure 11

a The basic framework of the confusion matrix; and ( b ) A presentation of the ROC curve

An ROC is one of the fundamental tools for diagnostic test evaluation and is created by plotting the true positive rate against the false positive rate at various threshold settings [ 37 ]. The area under the ROC curve (AUC) is also commonly used to determine the predictability of a classifier. A higher AUC value represents the superiority of a classifier and vice versa. Figure  11 b illustrates a presentation of three ROC curves based on an abstract dataset. The area under the blue ROC curve is half of the shaded rectangle. Thus, the AUC value for this blue ROC curve is 0.5. Due to the coverage of a larger area, the AUC value for the red ROC curve is higher than that of the black ROC curve. Hence, the classifier that produced the red ROC curve shows higher predictive accuracy compared with the other two classifiers that generated the blue and red ROC curves.

There are few other measures that are also used to assess the performance of different classifiers. One such measure is the running mean square error (RMSE). For different pairs of actual and predicted values, RMSE represents the mean value of all square errors. An error is the difference between an actual and its corresponding predicted value. Another such measure is the mean absolute error (MAE). For an actual and its predicted value, MAE indicates the absolute value of their difference.

The final dataset contained 48 articles, each of which implemented more than one variant of supervised machine learning algorithms for a single disease prediction. All implemented variants were already discussed in the methods section as well as the more frequently used performance measures. Based on these, we reviewed the finally selected 48 articles in terms of the methods used, performance measures as well as the disease they targeted.

In Table  1 , names and references of the diseases and the corresponding supervised machine learning algorithms used to predict them are discussed. For each of the disease models, the better performing algorithm is also described in this table. This study considered 48 articles, which in total made the prediction for 49 diseases or conditions (one article predicted two diseases [ 17 ]). For these 49 diseases, 50 algorithms were found to show the superior accuracy. One disease has two algorithms (out of 5) that showed the same higher-level accuracies [ 36 ]. To sum up, 49 diseases were predicted in 48 articles considered in this study and 50 supervised machine learning algorithms were found to show the superior accuracy. The advantages and limitations of different supervised machine learning algorithms are shown in Table  2 .

The comparison of the usage frequency and accuracy of different supervised learning algorithms are shown in Table  3 . It is observed that SVM has been used most frequently (29 out of 49 diseases that were predicted). This is followed by NB, which has been used in 23 articles. Although RF has been considered the second least number of times, it showed the highest percentage (i.e., 53%) in revealing the superior accuracy followed by SVM (i.e., 41%).

In Table  4 , the performance comparison of different supervised machine learning algorithms for most frequently modelled diseases is shown. It is observed that SVM showed the superior accuracy at most times for three diseases (e.g., heart disease, diabetes and Parkinson’s disease). For breast cancer, ANN showed the superior accuracy at most times.

A close investigation of Table 1 reveals an interesting result regarding the performance of different supervised learning algorithms. This result has also been reported in Table 4 . Consideration of only those articles that used clinical and demographic data (15 articles) reveals DT as to show the superior result at most times (6). Interestingly, SVM has been found the least time (1) to show the superior result although it showed the superior accuracy at most times for heart disease, diabetes and Parkinson’s disease (Table 4 ). In other 33 articles that used research data other than ‘clinical and demographic’ type, SVM and RF have been found to show the superior accuracy at most times (12) and second most times (7), respectively. In articles where 10-fold and 5-fold validation methods were used, SVM has been found to show the superior accuracy at most times (5 and 3 times, respectively). On the other side, articles where no method was used for validation, ANN has been found at most times to show the superior accuracy. Figure  12 further illustrates the superior performance of SVM. Performance statistics from Table 4 have been used in a normalised way to draw these two graphs. Fig.  12 a illustrates the ROC graph for the four diseases (i.e., Heart disease, Diabetes, Breast cancer and Parkinson’s disease) under the ‘ disease names that were modelled ’ criterion. The ROC graph based on the ‘ validation method followed ’ criterion has been presented in Fig.  12 b.

figure 12

Illustration of the superior performance of the Support vector machine using ROC graphs (based on the data from Table 4 ) – ( a ) for disease names that were modelled; and ( b ) for validation methods that were followed

To avoid the risk of selection bias, from the literature we extracted those articles that used more than one supervised machine learning algorithm. The same supervised learning algorithm can generate different results across various study settings. There is a chance that a performance comparison between two supervised learning algorithms can generate imprecise results if they were employed in different studies separately. On the other side, the results of this study could suffer a variable selection bias from individual articles considered in this study. These articles used different variables or measures for disease prediction. We noticed that the authors of these articles did not consider all available variables from the corresponding research datasets. The inclusion of a new variable could improve the accuracy of an underperformed algorithm considered in the underlying study, and vice versa. This is one of the limitations of this study. Another limitation of this study is that we considered a broader level classification of supervised machine learning algorithms to make a comparison among them for disease prediction. We did not consider any sub-classifications or variants of any of the algorithms considered in this study. For example, we did not make any performance comparison between least-square and sparse SVMs; instead of considering them under the SVM algorithm. A third limitation of this study is that we did not consider the hyperparameters that were chosen in different articles of this study in comparing multiple supervised machine learning algorithms. It has been argued that the same machine learning algorithm can generate different accuracy results for the same data set with the selection of different values for the underlying hyperparameters [ 81 , 82 ]. The selection of different kernels for support vector machines can result a variation in accuracy outcomes for the same data set. Similarly, a random forest could generate different results, while splitting a node, with the changes in the number of decision trees within the underlying forest.

This research attempted to study comparative performances of different supervised machine learning algorithms in disease prediction. Since clinical data and research scope varies widely between disease prediction studies, a comparison was only possible when a common benchmark on the dataset and scope is established. Therefore, we only chose studies that implemented multiple machine learning methods on the same data and disease prediction for comparison. Regardless of the variations on frequency and performances, the results show the potential of these families of algorithms in the disease prediction.

Availability of data and materials

The data used in this study can be extracted from online databases. The detail of this extraction has been described within the manuscript.

Abbreviations

Area under the ROC curve

Decision Tree

False negative

False positive

Mean absolute error

Running mean square error

Receiver operating characteristic

True negative

True positive

T. M. Mitchell, “Machine learning WCB”: McGraw-Hill Boston, MA:, 1997.

Google Scholar  

Sebastiani F. Machine learning in automated text categorization. ACM Comput Surveys (CSUR). 2002;34(1):1–47.

Sinclair C, Pierce L, Matzner S. An application of machine learning to network intrusion detection. In: Computer Security Applications Conference, 1999. (ACSAC’99) Proceedings. 15th Annual; 1999. p. 371–7. IEEE.

Sahami M, Dumais S, Heckerman D, Horvitz E. A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization: Papers from the 1998 workshop, vol. 62; 1998. p. 98–105. Madison, Wisconsin.

Aleskerov E, Freisleben B, Rao B. Cardwatch: A neural network based database mining system for credit card fraud detection. In: Computational Intelligence for Financial Engineering (CIFEr), 1997., Proceedings of the IEEE/IAFE 1997; 1997. p. 220–6. IEEE.

Kim E, Kim W, Lee Y. Combination of multiple classifiers for the customer's purchase behavior prediction. Decis Support Syst. 2003;34(2):167–75.

Mahadevan S, Theocharous G. “Optimizing Production Manufacturing Using Reinforcement Learning,” in FLAIRS Conference; 1998. p. 372–7.

Yao D, Yang J, Zhan X. A novel method for disease prediction: hybrid of random forest and multivariate adaptive regression splines. J Comput. 2013;8(1):170–7.

R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, Machine learning: an artificial intelligence approach. Springer Science & Business Media, 2013.

Culler SD, Parchman ML, Przybylski M. Factors related to potentially preventable hospitalizations among the elderly. Med Care. 1998;1:804–17.

Uddin MS, Hossain L. Social networks enabled coordination model for cost Management of Patient Hospital Admissions. J Healthc Qual. 2011;33(5):37–48.

PubMed   Google Scholar  

Lee PP, et al. Cost of patients with primary open-angle glaucoma: a retrospective study of commercial insurance claims data. Ophthalmology. 2007;114(7):1241–7.

Davis DA, Chawla NV, Christakis NA, Barabási A-L. Time to CARE: a collaborative engine for practical disease prediction. Data Min Knowl Disc. 2010;20(3):388–415.

McCormick T, Rudin C, Madigan D. A hierarchical model for association rule mining of sequential events: an approach to automated medical symptom prediction; 2011.

Yiannakoulias N, Schopflocher D, Svenson L. Using administrative data to understand the geography of case ascertainment. Chron Dis Can. 2009;30(1):20–8.

CAS   Google Scholar  

Fisher ES, Malenka DJ, Wennberg JE, Roos NP. Technology assessment using insurance claims: example of prostatectomy. Int J Technol Assess Health Care. 1990;6(02):194–202.

CAS   PubMed   Google Scholar  

Farran B, Channanath AM, Behbehani K, Thanaraj TA. Predictive models to assess risk of type 2 diabetes, hypertension and comorbidity: machine-learning algorithms and validation using national health data from Kuwait-a cohort study. BMJ Open. 2013;3(5):e002457.

PubMed   PubMed Central   Google Scholar  

Ahmad LG, Eshlaghy A, Poorebrahimi A, Ebrahimi M, Razavi A. Using three machine learning techniques for predicting breast cancer recurrence. J Health Med Inform. 2013;4(124):3.

Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Ann Intern Med. 2009;151(4):264–9.

Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30.

Palaniappan S, Awang R. Intelligent heart disease prediction system using data mining techniques. In: Computer Systems and Applications, 2008. AICCSA 2008. IEEE/ACS International Conference on; 2008. p. 108–15. IEEE.

Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. Wiley; 2013.

Joachims T. Making large-scale SVM learning practical. SFB 475: Komplexitätsreduktion Multivariaten Datenstrukturen, Univ. Dortmund, Dortmund, Tech. Rep. 1998. p. 28.

Quinlan JR. Induction of decision trees. Mach Learn. 1986;1(1):81–106.

Cruz JA, Wishart DS. Applications of machine learning in cancer prediction and prognosis. Cancer Informat. 2006;2:59–77.

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Lindley DV. Fiducial distributions and Bayes’ theorem. J Royal Stat Soc. Series B (Methodological). 1958;1:102–7.

I. Rish, “An empirical study of the naive Bayes classifier,” in IJCAI 2001 workshop on empirical methods in artificial intelligence, 2001, vol. 3, 22, pp. 41–46: IBM New York.

Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13(1):21–7.

McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys. 1943;5(4):115–33.

Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323(6088):533.

Falagas ME, Pitsouni EI, Malietzis GA, Pappas G. Comparison of PubMed, Scopus, web of science, and Google scholar: strengths and weaknesses. FASEB J. 2008;22(2):338–42.

PubMed. (2018). https://www.ncbi.nlm.nih.gov/pubmed/ .

Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine learning and data mining methods in diabetes research. Comput Struct Biotechnol J. 2017;15:104–16.

Pedregosa F, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30.

Borah MS, Bhuyan BP, Pathak MS, Bhattacharya P. Machine learning in predicting hemoglobin variants. Int J Mach Learn Comput. 2018;8(2):140–3.

Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74.

Aneja S, Lal S. Effective asthma disease prediction using naive Bayes—Neural network fusion technique. In: International Conference on Parallel, Distributed and Grid Computing (PDGC); 2014. p. 137–40. IEEE.

Ayer T, Chhatwal J, Alagoz O, Kahn CE Jr, Woods RW, Burnside ES. Comparison of logistic regression and artificial neural network models in breast cancer risk estimation. Radiographics. 2010;30(1):13–22.

Lundin M, Lundin J, Burke H, Toikkanen S, Pylkkänen L, Joensuu H. Artificial neural networks applied to survival prediction in breast cancer. Oncology. 1999;57(4):281–6.

Delen D, Walker G, Kadam A. Predicting breast cancer survivability: a comparison of three data mining methods. Artif Intell Med. 2005;34(2):113–27.

Chen M, Hao Y, Hwang K, Wang L, Wang L. Disease prediction by machine learning over big data from healthcare communities. IEEE Access. 2017;5:8869–79.

Cai L, Wu H, Li D, Zhou K, Zou F. Type 2 diabetes biomarkers of human gut microbiota selected via iterative sure independent screening method. PLoS One. 2015;10(10):e0140827.

Malik S, Khadgawat R, Anand S, Gupta S. Non-invasive detection of fasting blood glucose level via electrochemical measurement of saliva. SpringerPlus. 2016;5(1):701.

Mani S, Chen Y, Elasy T, Clayton W, Denny J. Type 2 diabetes risk forecasting from EMR data using machine learning. In: AMIA annual symposium proceedings, vol. 2012; 2012. p. 606. American Medical Informatics Association.

Tapak L, Mahjub H, Hamidi O, Poorolajal J. Real-data comparison of data mining methods in prediction of diabetes in Iran. Healthc Inform Res. 2013;19(3):177–85.

Sisodia D, Sisodia DS. Prediction of diabetes using classification algorithms. Procedia Comput Sci. 2018;132:1578–85.

Yang J, Yao D, Zhan X, Zhan X. Predicting disease risks using feature selection based on random forest and support vector machine. In: International Symposium on Bioinformatics Research and Applications; 2014. p. 1–11. Springer.

Juhola M, Joutsijoki H, Penttinen K, Aalto-Setälä K. Detection of genetic cardiac diseases by Ca 2+ transient profiles using machine learning methods. Sci Rep. 2018;8(1):9355.

Long NC, Meesad P, Unger H. A highly accurate firefly based algorithm for heart disease prediction. Expert Syst Appl. 2015;42(21):8221–31.

Jin B, Che C, Liu Z, Zhang S, Yin X, Wei X. Predicting the risk of heart failure with ehr sequential data modeling. IEEE Access. 2018;6:9256–61.

Puyalnithi T, Viswanatham VM. Preliminary cardiac disease risk prediction based on medical and behavioural data set using supervised machine learning techniques. Indian J Sci Technol. 2016;9(31):1–5.

Forssen H, et al. Evaluation of Machine Learning Methods to Predict Coronary Artery Disease Using Metabolomic Data. Stud Health Technol Inform. 2017;235: IOS Press:111–5.

Tang Z-H, Liu J, Zeng F, Li Z, Yu X, Zhou L. Comparison of prediction model for cardiovascular autonomic dysfunction using artificial neural network and logistic regression analysis. PLoS One. 2013;8(8):e70571.

CAS   PubMed   PubMed Central   Google Scholar  

Toshniwal D, Goel B, Sharma H. Multistage Classification for Cardiovascular Disease Risk Prediction. In: International Conference on Big Data Analytics; 2015. p. 258–66. Springer.

Alonso DH, Wernick MN, Yang Y, Germano G, Berman DS, Slomka P. Prediction of cardiac death after adenosine myocardial perfusion SPECT based on machine learning. J Nucl Cardiol. 2018;1:1–9.

Mustaqeem A, Anwar SM, Majid M, Khan AR. Wrapper method for feature selection to classify cardiac arrhythmia. In: Engineering in Medicine and Biology Society (EMBC), 39th Annual International Conference of the IEEE; 2017. p. 3656–9. IEEE.

Mansoor H, Elgendy IY, Segal R, Bavry AA, Bian J. Risk prediction model for in-hospital mortality in women with ST-elevation myocardial infarction: a machine learning approach. Heart Lung. 2017;46(6):405–11.

Kim J, Lee J, Lee Y. Data-mining-based coronary heart disease risk prediction model using fuzzy logic and decision tree. Healthc Inform Res. 2015;21(3):167–74.

Taslimitehrani V, Dong G, Pereira NL, Panahiazar M, Pathak J. Developing EHR-driven heart failure risk prediction models using CPXR (log) with the probabilistic loss function. J Biomed Inform. 2016;60:260–9.

Anbarasi M, Anupriya E, Iyengar N. Enhanced prediction of heart disease with feature subset selection using genetic algorithm. Int J Eng Sci Technol. 2010;2(10):5370–6.

Bhatla N, Jyoti K. An analysis of heart disease prediction using different data mining techniques. Int J Eng. 2012;1(8):1–4.

Thenmozhi K, Deepika P. Heart disease prediction using classification with different decision tree techniques. Int J Eng Res Gen Sci. 2014;2(6):6–11.

Tamilarasi R, Porkodi DR. A study and analysis of disease prediction techniques in data mining for healthcare. Int J Emerg Res Manag Technoly ISSN. 2015;1:2278–9359.

Marikani T, Shyamala K. Prediction of heart disease using supervised learning algorithms. Int J Comput Appl. 2017;165(5):41–4.

Lu P, et al. Research on improved depth belief network-based prediction of cardiovascular diseases. J Healthc Eng. 2018;2018:1–9.

Khateeb N, Usman M. Efficient Heart Disease Prediction System using K-Nearest Neighbor Classification Technique. In: Proceedings of the International Conference on Big Data and Internet of Thing; 2017. p. 21–6. ACM.

Patel SB, Yadav PK, Shukla DD. Predict the diagnosis of heart disease patients using classification mining techniques. IOSR J Agri Vet Sci (IOSR-JAVS). 2013;4(2):61–4.

Venkatalakshmi B, Shivsankar M. Heart disease diagnosis using predictive data mining. Int J Innovative Res Sci Eng Technol. 2014;3(3):1873–7.

Ani R, Sasi G, Sankar UR, Deepa O. Decision support system for diagnosis and prediction of chronic renal failure using random subspace classification. In: Advances in Computing, Communications and Informatics (ICACCI), 2016 International Conference on; 2016. p. 1287–92. IEEE.

Islam MM, Wu CC, Poly TN, Yang HC, Li YC. Applications of Machine Learning in Fatty Live Disease Prediction. In: 40th Medical Informatics in Europe Conference, MIE 2018; 2018. p. 166–70. IOS Press.

Lynch CM, et al. Prediction of lung cancer patient survival via supervised machine learning classification techniques. Int J Med Inform. 2017;108:1–8.

Chen C-Y, Su C-H, Chung I-F, Pal NR. Prediction of mammalian microRNA binding sites using random forests. In: System Science and Engineering (ICSSE), 2012 International Conference on; 2012. p. 91–5. IEEE.

Eskidere Ö, Ertaş F, Hanilçi C. A comparison of regression methods for remote tracking of Parkinson’s disease progression. Expert Syst Appl. 2012;39(5):5523–8.

Chen H-L, et al. An efficient diagnosis system for detection of Parkinson’s disease using fuzzy k-nearest neighbor approach. Expert Syst Appl. 2013;40(1):263–71.

Behroozi M, Sami A. A multiple-classifier framework for Parkinson’s disease detection based on various vocal tests. Int J Telemed Appl. 2016;2016:1–9.

Hussain L, et al. Prostate cancer detection using machine learning techniques by employing combination of features extracting strategies. Cancer Biomarkers. 2018;21(2):393–413.

Zupan B, DemšAr J, Kattan MW, Beck JR, Bratko I. Machine learning for survival analysis: a case study on recurrence of prostate cancer. Artif Intell Med. 2000;20(1):59–75.

Hung C-Y, Chen W-C, Lai P-T, Lin C-H, Lee C-C. Comparing deep neural network and other machine learning algorithms for stroke prediction in a large-scale population-based electronic medical claims database. In: Engineering in Medicine and Biology Society (EMBC), 2017 39th Annual International Conference of the IEEE, vol. 1; 2017. p. 3110–3. IEEE.

Atlas L, et al. A performance comparison of trained multilayer perceptrons and trained classification trees. Proc IEEE. 1990;78(10):1614–9.

Lucic M, Kurach K, Michalski M, Bousquet O, Gelly S. Are GANs created equal? a large-scale study. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems; 2018. p. 698–707. Curran Associates Inc.

Levy O, Goldberg Y, Dagan I. Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguistics. 2015;3:211–25.

Download references

Acknowledgements

Not applicable.

This study did not receive any funding.

Author information

Authors and affiliations.

Complex Systems Research Group, Faculty of Engineering, The University of Sydney, Room 524, SIT Building (J12), Darlington, NSW, 2008, Australia

Shahadat Uddin, Arif Khan & Md Ekramul Hossain

Health Market Quality Research Stream, Capital Markets CRC, Level 3, 55 Harrington Street, Sydney, NSW, Australia

Faculty of Medicine and Health, School of Medical Sciences, The University of Sydney, Camperdown, NSW, 2006, Australia

Mohammad Ali Moni

You can also search for this author in PubMed   Google Scholar

Contributions

SU: Originator of the idea, data analysis and writing. AK: Data analysis and writing. MEH: Data analysis and writing. MAM: Data analysis and critical review of the manuscript. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Shahadat Uddin .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they do not have any competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Uddin, S., Khan, A., Hossain, M. et al. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak 19 , 281 (2019). https://doi.org/10.1186/s12911-019-1004-8

Download citation

Received : 28 January 2019

Accepted : 11 December 2019

Published : 21 December 2019

DOI : https://doi.org/10.1186/s12911-019-1004-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Medical data
  • Disease prediction

BMC Medical Informatics and Decision Making

ISSN: 1472-6947

disease prediction using machine learning research paper

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 24 February 2021

Infectious disease outbreak prediction using media articles with machine learning models

  • Juhyeon Kim 1 , 2 &
  • Insung Ahn 1 , 2  

Scientific Reports volume  11 , Article number:  4413 ( 2021 ) Cite this article

9245 Accesses

13 Citations

Metrics details

  • Computer science
  • Epidemiology
  • Infectious diseases
  • Information technology

When a newly emerging infectious disease breaks out in a country, it brings critical damage to both human health conditions and the national economy. For this reason, apprehending which disease will newly emerge, and preparing countermeasures for that disease, are required. Many different types of infectious diseases are emerging and threatening global human health conditions. For this reason, the detection of emerging infectious disease pattern is critical. However, as the epidemic spread of infectious disease occurs sporadically and rapidly, it is not easy to predict whether an infectious disease will emerge or not. Furthermore, accumulating data related to a specific infectious disease is not easy. For these reasons, finding useful data and building a prediction model with these data is required. The Internet press releases numerous articles every day that rapidly reflect currently pending issues. Thus, in this research, we accumulated Internet articles from Medisys that were related to infectious disease, to see if news data could be used to predict infectious disease outbreak. Articles related to infectious disease from January to December 2019 were collected. In this study, we evaluated if newly emerging infectious diseases could be detected using the news article data. Support Vector Machine (SVM), Semi-supervised Learning (SSL), and Deep Neural Network (DNN) were used for prediction to examine the use of information embedded in the web articles: and to detect the pattern of emerging infectious disease.

Similar content being viewed by others

disease prediction using machine learning research paper

Crowdsourcing and machine learning approaches for extracting entities indicating potential foodborne outbreaks from social media

disease prediction using machine learning research paper

Forecasting virus outbreaks with social media data via neural ordinary differential equations

disease prediction using machine learning research paper

Development of an early alert model for pandemic situations in Germany

Introduction.

The spread of middle East respiratory syndrome (MERS) in 2015 caused 185 confirmed cases and 36 deaths 1 . The first outbreak of MERS in the Republic of Korea (Korea) occurred on May 2015, after a 68-year-old man returned from a business trip to several Middle East countries. As Korea could not predict if MERS might flow across the border, MERS not only threatened public health, but also caused huge economic loss in many different categories, including the tourist industry and social activity. Such a situation indicates that judging if an infectious disease will influx from other countries or not in advance is an important issue to minimize the damage that ensues. MERS was first reported in September 2012 from Saudi Arabia, and was reported from several European countries, before MERS occurred in Korea during 2015 1 . As MERS was not a commonly known disease in Korea, there was indifference to it before it occurred. However, if it was possible to predict that MERS might flow into Korea while it was spreading around the world, Korea could have prepared for the outbreak of the MERS to minimize the damage it caused. On the other hand, while MERS was spreading through several continents, Ebola spread through 5 different countries in Western Africa, infecting more than 6,500 people, and killing more than 3000 people 2 . Even though Ebola outbreaks occurred a few times on the Africa continent, the 2014 pandemic was the biggest one 3 . The 2014 Ebola pandemic in Western Africa showed a fatality rate of over 50%. However unlike MERS, Ebola, did not spread throughout other continents.

Many different infectious diseases threaten lives worldwide. Some diseases, like MERS, cause pandemics, spreading from country to country over continents, while some do not spread over continents, but like Ebola, circulate only in a few countries. As infectious disease issues arise worldwide, many researches were conducted to estimate and predict the occurrence of infectious diseases. Authors of 4 , 5 developed infectious disease spread simulation models using mathematical models. These research efforts utilized susceptible infected recovered (SIR) models to build an infectious disease spread simulation model, and suggest strategies to control infectious disease and maximize the effect of vaccination with the results from the simulation models. Commonly, these SIR simulation models concern the population of the area the model is based on and the characteristic of the disease, such as infection rate, incubation rate, and recovery rate. Some research considers the passengers of flights crossing borders to explain how infectious disease spreads abroad 6 . Moreover, authors of 7 claimed that infectious disease epidemics can be related to climate and climatic events, such as El Nino. According to the existing research reports above, the occurrence of infectious diseases varies depending on many different reasons, such as climate, lifestyle of countries, diplomatic relations between countries, or population. Thus, it is important to collect and use the latest data for future infectious disease outbreak prediction. However, the degree of these features for each country varies according to the passage of time. For example, El Nino changes the climatic attributes throughout the world, digitalization changes the lifestyle of human, and the number of travelers or the amount of trade between countries may change dramatically for political reasons. Consequently, constructing an infectious disease outbreak prediction model considering all these features is a challenging matter. However, as infectious disease spreads based on all these features, it may be possible to assert that the rate of particular infectious disease occurrence in a particular country connotes the information mentioned above. This means that we can assume that some infectious disease occurs in a particular country, because the conditions of certain features, such as climate, population, lifestyle, and the number of incoming travelers exceed thresholds for the disease in that country. With this assumption, it is possible to forecast if an infectious disease that has not occurred recently in a particular country will break out or not in that country, by analyzing the patterns of many different types of infectious diseases occurring in different countries.

Normally, when an infectious disease breaks out, the press media publish articles concerning the disease. When the seriousness of the disease becomes higher for some reason, like the increase in the number of infected people, the number of published articles also increases. In other words, the number of articles reported related to a particular disease in a particular country reflects how severe the disease is in that country. Furthermore, media articles and reports are updated in real-time through the Internet service worldwide, which offers the advantage of accumulating the latest data immediately, while collecting actual surveillance data of numerous disease types from countries worldwide is a difficult task 8 . Therefore, various attempts have been made to utilize media article data to predict an epidemic outbreak. Most of these studies, utilizing data from online media articles, try to figure out the epidemics occurring in specific country. In study 9 , media articles related to specific infectious diseases that occurred in the United States, China, and India respectively were collected, and based on this, the temporal topic trend was compared with the actual disease case count. The outbreak of whooping cough, rabies, salmonellosis, and E. coli infection in the United States, H7N9, hand, foot, and mouth disease, and dengue in China, and acute diarrheal disease, dengue, and malaria in India were estimated by proposing method. This allowed the authors to successfully capture the dynamics of disease outbreak by the temporal topic trends obtained through media articles. In other words, the degree of the temporal topic trend for a specific infectious disease in such a specific country can actually indicate the severity of the infectious disease in that country. Furthermore, in a study proposing a method to monitor infectious diseases using online news media data, the proposed model was applied to the outbreak of dengue fever in India and the outbreak of zika virus in Brazil 10 . In the study, using the collected international newspaper data and local newspaper data, the number of news reports related to each disease was calculated, and how similar the number of actual disease cases was. The authors of the study argue that there is a possibility to build a surveillance system using news data even in developing countries that do not have a surveillance systems yet. The authors of 8 , 11 suggested a method of predicting the occurrence of infectious diseases by extracting keywords with high relevance to specific infectious diseases instead of the simple number of occurrences of media articles related to a specific infectious disease. All of the aforementioned studies suggest a method to estimate the number of patients with a specific infectious disease in a specific country using online media article data. The existing studies showed the potential that online media article data can make a great contribution to the prediction of infectious diseases. However, since previous studies have focused on establishing an outbreak surveillance system, such as measuring the number of existing infectious diseases in a specific region, there is a limitation that only a limited number of infectious diseases can be treated in a limited number of countries. In other words, it cannot handle various kinds of infectious diseases occurring in various countries because the country and the type of infectious disease are specified. In addition, the previous infectious disease prediction studies have successfully established a surveillance system for infectious diseases that have seasonality or have been present in certain countries, but there is a limitation that it is impossible to predict the occurrence of infectious diseases that have not occurred. Thus, this study proposes a methodology for predicting the occurrence of various infectious diseases that did not occur for 6 months in various countries around the world by analyzing media article data. The remainder of this paper is organized as follows. “ Methods ” section explains which data is used for infectious disease outbreak prediction, and introduces the three machine learning models, semi-supervised learning (SSL), support vector machine (SVM), and deep neural network (DNN). “ Experiments ” section details the performance measures, and the experimental settings and results. Finally, “ Results ” and “ Conclusion ” sections present our discussions and conclusions, respectively.

Nowadays, as the Internet service is supplied worldwide, people obtain information using the Internet service easily and rapidly. Even news articles are being published through the Internet, unlike in the past, when they were printed on paper and delivered. Accordingly, articles and reports related to infectious diseases are also being published and updated through the Internet media in real-time. In other words, unlike in the past, the Internet media has made it easy to obtain information about the seriousness of infectious disease issues around the world today. Thus, in this research, we collected articles and reports related to 115 different infectious diseases from Medisys, to predict if a particular infectious disease that had not occurred for several months in a particular country will break out in that country. Medisys serves news articles and reports of infectious disease published worldwide every day in real-time 12 . Articles and reports provided by Medisys are classified by disease, and include the date and time they were published, and the latitude and longitude of the information where the outbreak of disease occurred. Every articles are also published in rich site summary (RSS) form. RSS is a method of displaying content primarily used on news or blog sites. If website administrators display website content in RSS format, recipients of this information may use it in different formats. Figure  1 shows examples of data provided by Medisys in RSS form and their components. The information of each article is displayed between < item > and < /item > , and data such as article title, description, publication date, original url, language code, category indicating the name of the disease, latitude and longitude are displayed. Even though the Medisys reports do not provide where the articles are published, it is possible to track where they were published by analyzing the latitude and longitude information. We accumulated data from Medisys for January to December 2019. This data consisted of 115,279 articles published in 237 different countries. As described in Fig.  2 , the number of articles per nation, and infectious disease were extracted from the data, and utilized in this study. However, some poor and developing countries, especially if involved in wars, have less opportunity to publish digital data. Furthermore, the population sizes by country also varies which may affect the number of published articles. For these reasons, data is normalized between 0 and 1 by each country to adjust values measured on different scales. Figure  3 shows the reorganized data.

figure 1

Examples of data provided by Medisys in the form of RSS and components of RSS provided by Medisys.

figure 2

The number of articles published in each country collected from Medisys from January to December 2019: the closer the color of the country to yellow, the more diseases occurred, and the larger the circle, the more articles have occurred. The figure was created in Python3 using the Basemap Toolkit.

figure 3

The number of articles related to each disease by country collected from Medisys for 2019 from January to December (data is normalized, thus the brighter the color, the more articles; the darker, the less articles).

To apply the constructed data to machine learning models to predict if disease that had not occurred for several months in a particular country would occur or not, the data set was preprocessed as follows. For example, as shown in Fig.  4 , Table A extracted the number of articles related to 115 different diseases by 237 countries during the 6 month period February to July 2019. From Table A, a disease list that contains ‘0′, which means diseases never occurred from each country, was extracted and listed in Table B by country. Each disease listed in Table B is considered, as it may have the potential for outbreak, because it has not yet occurred in each country. Table C is the data from August to October 2019, 3 months after July 2019. Then if the data of a particular disease for a particular country is 0 in both Tables A and C, the label of the disease of the country becomes ‘ − 1′; while when Table A is 0, but Table C is > 0, the label becomes ‘ + 1′. These labels can be arranged as in Table D.

figure 4

Example of data preprocessing to predict infectious disease outbreak for 3 months after July 2019, using report count data from February to July 2019: Table A indicates the number of reports concerning each disease in each country from February to July 2019. Table B shows the lists of diseases that reported none during the 6 month period February to July 2019 in each country. Table C shows the number of counted reports related to listed diseases in each country from August to October in 2019. Finally, Table D shows the labels of each disease for each country. ‘ + 1′ indicates that the disease occurred in the country between August and October; in contrast, ‘ − 1′ indicates that the disease did not occur, while ‘ − ’ means that the disease had already occurred during the period February to July, thus the disease for the country does not display a label.

Once the data has been preprocessed as shown in Fig.  4 , it is possible to select a list of what infectious diseases should be predicted in each country, as shown in Table B, and based on this list, it is possible to create a label set for each country and for each disease, as shown in Table D. With the preprocessed data, the data set for the prediction models of each disease by country can be organized as shown in Fig.  5 . Every node in Fig.  5 is composed of data in Table A in Fig.  4 . In Fig.  5 , if more than a single report related to the infectious disease by country occurred, then ‘ + 1′ is labeled, and in contrast, in the case of a report that was not reported in Table A of Fig.  4 , ‘ − 1′ is labeled ,while unlabeled nodes ‘?’ are listed in Table B of Fig.  4 . In other words, each square shown in Fig.  5 is a set of labels for predicting disease outbreak in each country, and the overall data structure of each square can be represented as shown in Fig.  6 . In Fig.  6 , the number in each column is the number of media articles related to each infectious disease in each country that occurred during the specified period. For all unlabeled data, a data set as shown in Fig.  6 is formed based on label of Fig.  5 , and each data set is applied to machine learning models to predict the occurrence of a specific infectious disease in a specific country.

figure 5

Data set composition for the prediction model for each disease by country: For example, in the first row, the data set of diseases for Afghanistan is listed in the first row of Table B of Fig.  4 . Nodes with ‘ + 1′ indicate that the reports related to the disease occurred more than once in the country, while nodes with ‘ − 1′ indicate that the reports related to the diseases never occurred in Table A of Fig.  4 .

figure 6

The overall data structure of each square shown in Fig.  5 .

In this research, we adapted three different machine learning models to investigate if early disease outbreak detection would be possible using media articles and reports related to infectious disease, and compared the performance of the models. Three representative models, that is, support vector machine (SVM), which shows good performance consistently through various fields; semi-supervised learning (SSL), which shows good performance when label imbalanced data sets are used; and deep neural network (DNN), which is a trending method showing outstanding performance, were used to perform prediction for disease occurrence. The model parameters of SVM, SSL, and DNN were searched over the following ranges. For SVM, the best prediction performances were identified from the combinations of { γ, C}  ∈  {0.0001, 0.001, 0.01, 0.1, 1, 10} × {0.2, 0.4, 0.6, 0.8, 1} 13 . For SSL, k, which is a parameter to decide the number of neighbors was identified from k = {3, 7, 15, 20, 30}, and μ, which is a trade-off parameter, was identified from μ = {0.0001, 0.01, 1, 100, 1000}. Finally, DNN model was organized with 3 layers with batch size of 20 for each step. Dropouts are set as 0.3 for each layer, and Adam gradient descent optimization was applied, while epoch was set as 500. After disease outbreak prediction is made with each model, the model performance is calculated using Table D of Fig.  4 by comparing the prediction results with the corresponding infectious disease 3 months after the last date used as the training data. The order of progress from data preprocessing to prediction can be summarized as shown in Fig.  7 .

figure 7

The order of progress from data preprocessing to prediction.

Ethics approval and consent to participate

This study did not involve human participants, data, or tissue. Institutional review board approval was not required.

Experiments

Media articles and reports that are published from January to December 2019 crawled from Medisys are used in this research. The crawled data includes the title of articles, description, published date and time, disease related to, and the latitude and longitude information. Parsing the data, counts of the number of daily articles related to each disease by country are extracted, and organized as a numerical dataset. A total of 115 different diseases and 237 different countries are concerned with the extracted dataset, and the average count of the number of daily articles is about 1300. Each data point is normalized between 0 and 1. As shown in Fig.  8 , experiments are done with two different strategies, setting the length of training data as (6 and 3) months, and the validation data as 3 months, respectively. It is discovered whether each model can predict whether diseases will break out or not by country during the 3 months after the training data of July to September, August to October, September to November, and October to December, respectively.

figure 8

Using data crawled for a year, experiments are set as first, each model being trained using 6 months’ data, and predicting if the disease will outbreak or not; and second, each model being trained using 3 months’ data, and predicting if the disease will outbreak or not.

To measure the performance of each prediction model, AUC, Accuracy, and F1 score are used 14 , 15 . The AUC assesses the overall value of a classifier, which is a threshold-independent measure of model performance based on the receiver operating characteristic curve that plots the trade-offs between sensitivity and 1—specificity for all possible values of threshold. Accuracy is a measure of the total number of correct predictions when the value of the classification threshold is set to 0. Lastly, the F1 score can be interpreted as the weighted average of the precision and recall, where an F1 score reaches its best value at 1, while the worst score is 0.

The results of the experiment are based on the expected accuracy of whether the diseases that had not been reported for (6 or 3) months will break out or not by country. Tables 1 and 2 show a comparison of the results with SVM, SSL, and DNN in terms of the accuracy, ROC, and F1 score. For each of the three models, the best performance was selected by searching over the respective model-parameter space. For each dataset, the best performance among the three models is marked in bold face. In terms of the accuracy, SSL shows the best performance, with an average accuracy of (0.838 and 0.834). In terms of the ROC, SSL delivers outstanding performance, with an average ROC of (0.791 and 0.805). Lastly, even in the F1 score case, SSL produces an average (0.832 and 0.802), which is the best of the three models. Figure  9 summarizes the performance of the three models in bar graphs. Even though SSL shows outstanding performance compared to other two models, SVM and DNN also show reasonable performance, showing average accuracy over 0.7, and F1 score over 0.75.

figure 9

Accuracy, ROC, and F1 score of each validation data set period by each model, respectively.

In Fig.  10 , the prediction accuracy of SSL for 8 different experiments are shown in the world map. Some countries are not colored in the map because every kinds of diseases were mentioned through media articles in these countries. In other words, these countries had no diseases to be predicted. While prediction accuracy of most countries is over 0.8, there are some countries showing very low prediction accuracy. This is because countries showing low accuracy contains only small number of diseases to be predicted. Thus, a wrong prediction of any one would significantly reduce the accuracy of the prediction.

figure 10

Prediction accuracy of SSL by each country: blue circles in the map indicate the number of predicted diseases for the country, and the closer the yellow, the more accurate the blue, the lower the accuracy. The figure was created in Python3 using the Basemap Toolkit.

In this research, the potential of utilizing media data to predict if an infectious disease will break out or not in a particular country using three of the most widely used machine learning models showed reasonable prediction performances. The occurrence of infectious diseases varies depending on many different reasons, such as climate, lifestyle of countries, diplomatic relations between countries, or population and so on. Therefore, similar types of infectious diseases are likely to occur in countries with similar comprehensive environments. In other words, countries with similar severity of various types of infectious diseases can be regarded as countries with similar environments. Thus, countries with similar infectious disease outbreak patterns can be identified by analyzing the patterns of severity of various types of infectious diseases between countries. Moreover, various existing studies have shown that the degree of incidence of media articles related to a specific infectious disease occurring in a specific country may indicate the severity of the disease in that country. Thus, this study attempted to predict the occurrence of specific infectious diseases in a specific country by analyzing the outbreak patterns of media articles related to various infectious diseases between countries. As the suggested method uses only media articles, even developing countries that have not yet constructed any disease surveillance systems, are able to forecast if particular infections will occur or not, because there are no critical limitations to accumulating such media articles.

Despite these advantages, further studies should be carried out in the near future to resolve several obstacles. First of all, the periods of training data and validation data were set by dividing the period of the year by the fourth quarter or half of the year from the data used for prediction, but a more systematic data time-setting strategy is needed, such as considering seasonal infectious diseases. Moreover, as Medisys does not provide old posts data, only about a year’s worth of data has been accumulated now since we started to collect Medisys data from the end of 2018. Therefore, when more data is collected, it is necessary to make predictions by later gathering additional data, and setting the duration of the training data to at least 1 year.

Second, even though all three models showed reasonable performance, it is necessary to discover methods to improve the performance of the prediction models. In this study, we also looked at whether it would be possible to improve the performance of prediction models when models are trained with data consisting of countries that show similar infectious disease occurrence patterns. Therefore, the prediction models for each country are trained using only data from countries with a correlation coefficient of 0.6 or higher, and Fig.  11 shows the performance of each prediction model. It was expected that the prediction results using data would consist of countries having disease outbreak pattern correlation coefficients of over 0.6; however, generally they showed worse performance. From this result, it can be inferred that even from countries where infectious disease patterns are dissimilar, the prediction model extracts useful information, and trains them. Thus, in further works, instead of feature selection, accumulating useful data, such as global air passenger data, which can represent the degree of relation between countries, is required to utilize them as weight for prediction models.

figure 11

Prediction performance comparison between models using all countries and models using countries having a disease occurrence pattern correlation coefficient of over 0.6.

The biggest reasons why it is not easy to predict the exact incidence of infectious disease is that a variety of characteristics, such as the nature of the infectious disease, the geographical characteristics of where the infectious diseases occur, the characteristics of people living in the country, the way people live, the kinds of things that spread infectious diseases, and the degree of exchanges between countries should all be taken into account. Furthermore, as time goes by, the weather changes due to global warming, digitalization changes people’s lifestyles, and for many reasons, the status of countries that trade frequently with each other changes. For these reasons, it is challenging work to create a predictive model that takes all of these characteristics into account. However, as the pattern of infectious diseases varies from country to country due to these various reasons, infectious disease incidence data by country can be considered to contain this information. Therefore, in this research, we tried to predict which disease will occur or not in particular countries, analyzing media data accumulated from Medisys using several machine learning models. Our suggested method showed reasonable prediction performance by the three different trending machine learning models SVM, SSL, and DNN. It is thought that the proposed model could be used to prepare for the future outbreak of infectious diseases in various countries, including developing countries that lack proper disease surveillance systems.

Data availability

The datasets used during the current study are available from the corresponding author on reasonable request.

Abbreviations

Support vector machine

Semi-supervised learning

Deep neural network

Middle east respiratory syndrome

Susceptible infected recovered

Receiver operating characteristic

European Centre for Disease Prevention and Control. Middle East Respiratory Syndrome Coronavirus (MERSCoV). 21st Update (ECDC, Stockholm, 2015).

Google Scholar  

Centers for Disease Control and Prevention. 2014 Ebola outbreak in West Africa: case counts, 2015. http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/case-counts.html . Accessed 6 April 2015.

Dixon, M. G. & Schafer, I. J. Ebola viral disease outbreak—West Africa, 2014. Morb. Mortal. Wkly Rep. 63 , 548–551 (2014).

Meyers, L. A. Contact network epidemiology: bond percolation applied to infectious disease prediction and control. Bull. Am. Math. Soc. 44 , 63–86 (2007).

Article   MathSciNet   Google Scholar  

Dimitrov, N. B. & Meyerss, L. A. Mathematical approaches to infectious disease prediction and control. INFORMS Tutor. Oper. Res. 7 , 1–25 (2010).

Hufnagel, L., Brockmann, D. & Geisel, T. Forecast and control of epidemics in a globalized world. Proc. Natl. Acad. Sci. U.S.A. 101 , 15124–15129 (2004).

Article   ADS   CAS   Google Scholar  

Colwell, R. Global climate and infectious disease: the cholera paradigm. Science 274 , 2025–2035 (1996).

Kim, J. & Ahn, I. Weekly ILI patient ratio change prediction using news articles with support vector machine. BMC Bioinform.. 20 , 1–16 (2019).

Article   Google Scholar  

Ghosh, S. et al. Temporal topic modeling to assess associations between news trends and infectious disease outbreaks. Sci. Rep. 7 , 40841 (2017).

Zhang, Y., Ibaraki, M. & Schwartz, F. W. Disease surveillance using online news: Dengue and zika in tropical countries. J. Biomed. Inform. 102 , 103374 (2020).

Charkraborty, S. & Subramanian, L. Extracting signals from news streams for disease outbreak prediction. In Proceedings of the IEEE Global Conference on Signal and Information Processing 1300–1304 (2016)

Steinberger, R., Fuart, F. & Best, C. et al. Text mining from the web for medical intelligence. Min. Massive Data Sets Secur. 19 , 295–310 (2008)

Shin, H. & Cho, S. Neighborhood property-based pattern selection for support vector machines. Neural Comput. 19 , 816–855 (2007).

Subramanya, A. & Bilmes, J. Soft-supervised learning for text classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Honolulu, Hawaii 1090–1099 (2008)

Allouche, O. et al. Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic. J. Appl. Ecol. 43 , 1223–1232 (2006).

Download references

Acknowledgements

This work was supported by a National Research Council of Science & Technology (NST) grant, funded by the Korea government (MSIP) (No. CRC-16-01-KRICT). This work was supported by the National Research Foundation of Korea (NRF) grant, funded by the Korea government (MEST) (No. 2016M3A9B6915714).

Author information

Authors and affiliations.

Department of Data-Centric Problem Solving Research, Korea Institute of Science and Technology Information, Yuseong-gu, Daejeon, Korea

Juhyeon Kim & Insung Ahn

Center for Convergent Research of Emerging Virus Infection, Korea Research Institute of Chemical Technology, Yuseong-gu, Daejeon, Korea

You can also search for this author in PubMed   Google Scholar

Contributions

J.K. and I.A. conceptualized the study, and visualized the data and results. J.K. curated the data, performed formal analysis, validated the results, and authored the primary manuscript. I.A. administered and supervised the project, and also reviewed and edited the writing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Insung Ahn .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Kim, J., Ahn, I. Infectious disease outbreak prediction using media articles with machine learning models. Sci Rep 11 , 4413 (2021). https://doi.org/10.1038/s41598-021-83926-2

Download citation

Received : 30 March 2020

Accepted : 10 February 2021

Published : 24 February 2021

DOI : https://doi.org/10.1038/s41598-021-83926-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Machine learning approaches to identify systemic lupus erythematosus in anti-nuclear antibody-positive patients using genomic data and electronic health records.

  • Chih-Wei Chung
  • Seng-Cho Chou
  • Yi-Ming Chen

BioData Mining (2024)

Deep learning techniques for detection and prediction of pandemic diseases: a systematic literature review

  • Sunday Adeola Ajagbe
  • Matthew O. Adigun

Multimedia Tools and Applications (2024)

Emerging infectious disease surveillance using a hierarchical diagnosis model and the Knox algorithm

  • Mengying Wang
  • Bingqing Yang

Scientific Reports (2023)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

disease prediction using machine learning research paper

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Multiple Disease Prediction System using Machine Learning: This project provides a stream lit web application for predicting multiple diseases, including diabetes, Parkinson's disease, and heart disease, using machine learning algorithms. The prediction models are deployed using Streamlit, a Python library for building interactive web applications.

Amit380/Multiple-Disease-Prediction-System-using-Machine-Learning

Folders and files, repository files navigation, multiple-disease-prediction-system-using-machine-learning.

![Home Page]

Multiple Disease Prediction System using Machine Learning: This project provides a streamlit web application for predicting multiple diseases, including diabetes, Parkinson's disease, and heart disease, using machine learning algorithms. The prediction models are deployed using Streamlit, a Python library for building interactive web applications.

Table of Contents

Introduction, contributing.

The Multiple Disease Prediction project aims to create a user-friendly web application that allows users to input relevant medical information and receive predictions for different diseases. The machine learning models trained on disease-specific datasets enable accurate predictions for diabetes, Parkinson's disease, and heart disease.

The Multiple Disease Prediction web application offers the following features:

  • User Input : Users can input their medical information, including age, gender, blood pressure, cholesterol levels, and other relevant factors.
  • Disease Prediction : The application utilizes machine learning models to predict the likelihood of having diabetes, Parkinson's disease, and heart disease based on the inputted medical data.
  • Prediction Results : The predicted disease outcomes are displayed to the user, providing an indication of the probability of each disease.
  • Visualization : Visualizations are generated to highlight important features and provide insights into the prediction process.
  • User-Friendly Interface : The web application offers an intuitive and user-friendly interface, making it easy for individuals without technical knowledge to use the prediction tool.

To use this project locally, follow these steps:

  • Clone the repository:
  • Install the required dependencies by running:

Download the pre-trained machine learning models for diabetes, Parkinson's disease, and heart disease. Make sure to place them in the appropriate directories within the project structure.

Update the necessary configurations and file paths in the project files.

To run the Multiple Disease Prediction web application, follow these steps:

Open a terminal or command prompt and navigate to the project directory.

Run the following command to start the Streamlit application:

Access the web application by opening the provided URL in your web browser.

Input the relevant medical information as requested by the application.

Click the "Predict" button to generate predictions for diabetes, Parkinson's disease, and heart disease based on the provided data.

View the prediction results and any accompanying visualizations or insights.

Feel free to customize the web application's appearance, add more disease prediction models, or integrate additional features based on your specific requirements.

Contributions to this project are welcome. If you find any issues or have suggestions for improvement, please open an issue or submit a pull request on the project's GitHub repository.

This project is licensed under the MIT License . You are free to modify and use the code for both personal and commercial purposes.

  • Jupyter Notebook 93.4%
  • Python 6.6%

IMAGES

  1. Disease Prediction System Using Machine Learning Project || Final Year Project

    disease prediction using machine learning research paper

  2. (PDF) Prediction of Heart Disease Using Machine Learning Algorithms

    disease prediction using machine learning research paper

  3. (PDF) Intelligent Parkinson Disease Prediction Using Machine Learning

    disease prediction using machine learning research paper

  4. (PDF) Symptoms Based Multiple Disease Prediction Model using Machine

    disease prediction using machine learning research paper

  5. [PDF] Effective Heart Disease Prediction Using Hybrid Machine Learning

    disease prediction using machine learning research paper

  6. Information

    disease prediction using machine learning research paper

VIDEO

  1. Mental Disease Prediction

  2. Mental Disease Prediction

  3. Heart Disease Prediction Project using Machine Learning

  4. Disease Prediction Using Machine Learning Algorithm

  5. DISEASE PREDICTION USING MACHINE LEARNING

  6. Disease diagnosis Prediction Machine Learning Project

COMMENTS

  1. (PDF) Disease Prediction Using Machine Learning

    Disease Prediction Using Machine Learning. * Research Gate Link: Marouane Fethi Ferjani. Computing Department. Bournemouth University. Bournemouth, England. [email protected]. Abstract ...

  2. Machine-Learning-Based Disease Diagnosis: A Comprehensive Review

    Machine learning for prediction of all-cause mortality in patients with suspected coronary artery disease: a 5-year multicentre prospective registry analysis: 257 : Random forest-based similarity measures for multi-modal classification of Alzheimer's disease: 248 : Effective Heart disease prediction Using hybrid Machine Learning techniques: 214

  3. Disease Prediction From Various Symptoms Using Machine Learning

    Developing a medical diagnosis system based on machine learning (ML) algorithms for prediction of any disease can help in a more accurate diagnosis than the conventional method. We have designed a disease prediction system using multiple ML algorithms. The data set used had more than 230 diseases for processing.

  4. Predictive modelling and identification of key risk factors ...

    The comprehensive analysis of various advanced machine learning models for stroke prediction that are presented in this research paper sheds light on the efficacy of different techniques and ...

  5. Identification and Prediction of Chronic Diseases Using Machine

    This paper proposed a method of identification and prediction of the presence of chronic disease in an individual using the machine learning algorithms such as CNN and KNN. The advantage of the proposed system is the use of both structured and unstructured data from real life for data set preparation, which lacks in many of the existing approaches.

  6. Machine learning prediction in cardiovascular diseases: a meta-analysis

    Most importantly, pooled analyses indicate that, in general, ML algorithms are accurate (AUC 0.8-0.9 s) in overall cardiovascular disease prediction. In subgroup analyses of each ML algorithms ...

  7. Prediction of Cancer Disease using Machine learning Approach

    Prediction of Cancer Disease using Machine learning Approach. Cancer has identified a diverse condition of several various subtypes. The timely screening and course of treatment of a cancer form is now a requirement in early cancer research because it supports the medical treatment of patients. Many research teams studied the application of ML ...

  8. Early-Stage Alzheimer's Disease Prediction Using Machine Learning

    Using machine learning and deep learning platforms, this study aims to combine recent research on four brain diseases: Alzheimer's disease, brain tumors, epilepsy, and Parkinson's disease. By using 22 brain disease databases that are used most during the reviews, the authors can determine the most accurate diagnostic method.

  9. Development of machine learning model for diagnostic disease prediction

    The numbers of disease prediction papers using XGBoost with medical data have increased recently 33,34,35,36. XGBoost is an algorithm that overcomes the shortcomings of GBM (gradient boosting ...

  10. Disease Prediction using machine learning algorithms

    Comparatively, supervised machine learning (ML) algorithms has shown notable capability in exceeding standard approach for disease detection and helps medical experts in the early detection of high-risk diseases. In this paper, algorithms discussed were K- Nearest Neighbor, Naïve Bayes, Support Vector Machine and Decision Trees.

  11. Disease Prediction using Machine Learning

    Disease Prediction using Machine Learning Abstract: The dependency on computer-based technology has resulted in storage of lot of electronic data in the health care industry. As a result of which, health professionals and doctors are dealing with demanding situations to research signs and symptoms correctly and perceive illnesses at an early stage.

  12. (PDF) THE PREDICTION OF DISEASE USING MACHINE LEARNING

    Disease Prediction using Machine Learning is the system that is used to predict the diseases from the symptoms which are given by the patients or any user. ... Discover the world's research. 25 ...

  13. Disease Prediction using Machine Learning Algorithms

    This research work carried out demonstrates the disease prediction system developed using Machine learning algorithms such as Decision Tree classifier, Random forest classifier, and Naïve Bayes classifier. The paper presents the comparative study of the results of the above algorithms used.

  14. (PDF) Using Machine Learning for Heart Disease Prediction

    This prediction is an area that is widely researched. Our paper is part of the research on the detection and prediction of heart disease. It is based on the application of Machine Learning ...

  15. Heart Disease Prediction using Machine Learning Techniques

    This research aims to foresee the odds of having heart disease as probable cause of computerized prediction of heart disease that is helpful in the medical field for clinicians and patients [].To accomplish the aim, we have discussed the use of various machine learning algorithms on the data set and dataset analysis is mentioned in this research paper.

  16. Multiple disease prediction using Machine learning algorithms

    It was discovered that, when compared to the most experienced physician, who can diagnose with 79.97% accuracy, machine learning algorithms could identify with 91.1% correctness [10]. Machine learning techniques are explicitly used to illness datasets to extract features for optimal illness diagnosis, prediction, prevention, and therapy. 2.

  17. Popular deep learning algorithms for disease prediction: a review

    6 Conclusion. This paper reviews the deep learning algorithms in the field of disease prediction. According to the type of data processed, the algorithms are divided into structured data algorithms and unstructured data algorithms. Structured data algorithms include ANN and FM-Deep Learning algorithms.

  18. Chronic kidney disease prediction using machine learning techniques

    This study focuses on chronic kidney disease prediction using machine learning models based on the dataset with big size and recent than online available dataset collected from St. Paulo's Hospital in Ethiopia with five classes: notckd, mild, moderate, severe, and ESRD and binary classes: ckd and notckd by applying machine-learning models.

  19. Comparing different supervised machine learning algorithms for disease

    Background Supervised machine learning algorithms have been a dominant method in the data mining field. Disease prediction using health data has recently shown a potential application area for these methods. This study aims to identify the key trends among different types of supervised machine learning algorithms, and their performance and usage for disease risk prediction. Methods In this ...

  20. Diabetes Prediction using Machine Learning Algorithms

    Various prediction models have been developed and implemented by various researchers using variants of data mining techniques, machine learning algorithms or also combination of these techniques. Dr Saravana Kumar N M, Eswari, Sampath P and Lavanya S (2015) implemented a system using Hadoop and Map Reduce technique for analysis of Diabetic data.

  21. Multiple Disease Prediction using Machine Learning

    This paper is an exploration towards the applications of machine learning techniques in the context of multiple disease prediction, which aims towards the enhancement of diagnostic accuracy and facilitates timely invention. ... Pooja and Patil, Mohini and Suryawanshi, Gayatri and Chaphadkar, Anagha, Multiple Disease Prediction using Machine ...

  22. Infectious disease outbreak prediction using media articles with

    In this research, the potential of utilizing media data to predict if an infectious disease will break out or not in a particular country using three of the most widely used machine learning ...

  23. PDF Multiple Disease Prediction System Using Machine Learning

    "Multiple Disease Prediction Using Machine Learning Algorithms" by Chauhan et al. (2021): This paper investigates using various ML algorithms, including SVM and Decision Trees, for multiple disease ... Exploring how to enhance accessibility for healthcare practitioners and ensuring ease of use could be a valuable research focus. 5.

  24. A Proposed Technique Using Machine Learning for the Prediction of

    With the increasing prevalence of diabetes in Saudi Arabia, there is a critical need for early detection and prediction of the disease to prevent long-term health complications. This study addresses this need by using machine learning (ML) techniques applied to the Pima Indians dataset and private diabetes datasets through the implementation of ...

  25. Early detection of Parkinson's disease using machine learning

    Parkinson's disease (PD) is a neurodegenerative disorder affecting 60% of people over the age of 50 years. Patients with Parkinson's (PWP) face mobility challenges and speech difficulties, making physical visits for treatment and monitoring a hurdle. PD can be treated through early detection, thus enabling patients to lead a normal life.

  26. Grape dataset: A dataset for disease prediction and classification for

    Evaluating the importance and impact of the articles which have been posted with the early detection of plant disease and were published between 2016 and 2020 uncovers that the territory of plant disorder has gotten elevated and hobby with the aid of using researchers, studies investment institutions, and experts.

  27. Reviewing the Impact of Machine Learning on Disease Diagnosis and

    By leveraging advanced algorithms to analyse medical data and images, machine learning enhances disease detection and diagnosis, contributing significantly to improved patient outcomes and the ...

  28. Amit380/Multiple-Disease-Prediction-System-using-Machine-Learning

    The Multiple Disease Prediction web application offers the following features: User Input: Users can input their medical information, including age, gender, blood pressure, cholesterol levels, and other relevant factors.; Disease Prediction: The application utilizes machine learning models to predict the likelihood of having diabetes, Parkinson's disease, and heart disease based on the ...

  29. Secure cloud storage for IoT based distributed healthcare environment

    , A novel diabetes healthcare disease prediction framework using machine learning techniques, Journal of Healthcare Engineering 2022 (2022). Google Scholar [18] Aldahiri A., Alrashed B. and Hussain W., Trends in using IoT with machine learning in health prediction system, Forecasting 3 (1) (2021), 181 - 206. Google Scholar

  30. Science

    Science is a rigorous, systematic endeavor that builds and organizes knowledge in the form of testable explanations and predictions about the world. Modern science is typically divided into three major branches: the natural sciences (e.g., physics, chemistry, and biology), which study the physical world; the social sciences (e.g., economics, psychology, and sociology), which study individuals ...