U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PMC10240481

Logo of sysrev

Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature

Julien knafou.

1 University of Applied Sciences and Arts of Western Switzerland (HES-SO), Rue de la Tambourine 17, 1227 Geneva, Switzerland

Quentin Haas

2 Risklick AG, Bern, Switzerland

Nikolay Borissov

3 CTU Bern, University of Bern, Bern, Switzerland

Michel Counotte

4 Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland

5 Wageningen Bioveterinary Research, Wageningen University & Research, Wageningen, The Netherlands

Aziz Mert Ipekci

Diana buitrago-garcia, leonie heron, poorya amini, douglas teodoro.

6 Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland

Associated Data

The datasets used and analyzed during the current study are available in the COAP living evidence database: https://zika.ispm.unibe.ch/assets/data/pub/ncov/ . The training, testing, and ensemble source codes are available under https://github.com/ds4dh/CovidReview .

The COVID-19 pandemic has led to an unprecedented amount of scientific publications, growing at a pace never seen before. Multiple living systematic reviews have been developed to assist professionals with up-to-date and trustworthy health information, but it is increasingly challenging for systematic reviewers to keep up with the evidence in electronic databases. We aimed to investigate deep learning-based machine learning algorithms to classify COVID-19-related publications to help scale up the epidemiological curation process.

In this retrospective study, five different pre-trained deep learning-based language models were fine-tuned on a dataset of 6365 publications manually classified into two classes, three subclasses, and 22 sub-subclasses relevant for epidemiological triage purposes. In a k -fold cross-validation setting, each standalone model was assessed on a classification task and compared against an ensemble, which takes the standalone model predictions as input and uses different strategies to infer the optimal article class. A ranking task was also considered, in which the model outputs a ranked list of sub-subclasses associated with the article.

The ensemble model significantly outperformed the standalone classifiers, achieving a F1-score of 89.2 at the class level of the classification task. The difference between the standalone and ensemble models increases at the sub-subclass level, where the ensemble reaches a micro F1-score of 70% against 67% for the best-performing standalone model. For the ranking task, the ensemble obtained the highest recall@3, with a performance of 89%. Using an unanimity voting rule, the ensemble can provide predictions with higher confidence on a subset of the data, achieving detection of original papers with a F1-score up to 97% on a subset of 80% of the collection instead of 93% on the whole dataset.

This study shows the potential of using deep learning language models to perform triage of COVID-19 references efficiently and support epidemiological curation and review. The ensemble consistently and significantly outperforms any standalone model. Fine-tuning the voting strategy thresholds is an interesting alternative to annotate a subset with higher predictive confidence.

The pandemic coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has led to a historic wave of scientific publications in the biomedical literature [ 1 , 2 ]. As of the beginning of the pandemic, scientific publications related to SARS-CoV-2 and COVID-19 came from the most diverse domains and became available in a myriad of digital repositories (preprint servers, technical reports, peer-reviewed scientific journals, etc.) [ 3 ]. This outbreak of publications grew at an unprecedented rate. In this context, it became challenging for medical experts and epidemiologists to follow the latest scientific developments and for curators to manually review and annotate all the available COVID-19 literature to consolidate the fast-moving existing body of knowledge [ 1 ].

Several methods for producing living systematic reviews have been proposed to provide up-to-date support for professionals dealing with the pace, amount, and complexity of the COVID-19-related literature [ 4 – 7 ]. A living systematic review describes a review methodology that allows updating information as soon as new evidence becomes available, rather than the methods applied to classic, time-restricted systematic reviews [ 8 , 9 ]. Moreover, living evidence can narrow the gap between knowledge and practice, as fresh publication findings are swiftly integrated in scientifically informed guidelines [ 5 , 6 , 9 ]. However, the maintenance of living evidence systems still requires continuous manual curation from highly qualified human resources [ 10 , 11 ]. One of the most time-consuming tasks is to screen the titles and/or abstracts resulting from a literature search and to exclude articles that are clearly ineligible, which may comprise a third or more of all records [ 2 ].

To address this paradigm, (semi-)automatic curation systems based on text mining and natural language processing (NLP) technologies have been developed to support review and annotation of large literature corpora [ 12 – 22 ]. These systems support the identification and ranking of relevant articles, the categorization of the selected documents in classes and subclasses for reviewing procedures, and enable information extraction from text passages (e.g., identification of disease passages). For example, Textpresso Central [ 16 ] provides a platform that allows users to create a customized annotated corpus by uploading and processing documents of their choosing. Once documents are loaded, personalized curation searches and pipelines can be applied. PubTator Central [ 19 ] is a service for viewing and retrieving bioconcept annotations in full-text biomedical articles. It comprises state-of-the-art text mining models for annotation of several biomedical entities, such as genes and proteins, diseases, chemicals, and species. SIBiLS [ 20 ] provide an optimized search engine in the biological literature by augmenting its contents with keywords and standardized entities. Variomes [ 22 ] are a system that can perform triage of publication to support evidence-based decision. Finally, PubTerm [ 13 ] enables the organization of abstracts by terms, using the co-occurrence of terms or by specific phrases, among others, to facilitate the biomedical curation process.

Automatic text classification appears as an essential methodology to ensure high quality of living evidence updates. Text classification consists of assigning categorical labels to a given text passage (e.g., an abstract) based on its similarity to the existing labeled examples [ 23 – 25 ]. Classical text classifiers use statistical document representations, in which the relevance of a word to a document is proportional to its frequency in the document and inversely proportional to its frequency in the collection (the so-called term frequency-inverse document frequency (tf-idf) framework), to create a vectorial representations of the documents [ 26 ]. These representations are then used in machine learning models, such as logistic regression and k-nearest neighbors, to learn a mapping function between the input text and the output classes [ 27 , 28 ]. The trained models can then predict the predefined labels for new input representations. These models are however limited as they essentially fail to capture the sequential nature of text and the context in which words are embedded.

To overcome the limitations of the tf-idf framework, state-of-the-art text classifiers use deep learning-based language models to create word and document contextual representations, with improved syntactic and semantic features [ 29 ]. Language models are a particular type of probabilistic model that, given a sequence of words, compute the probability distribution of the next word. Recent deep learning-based language models, such as the Bidirectional Encoder Representations of Transformers (BERT) [ 30 ], learn word representations considering both the forward- and backward-direction contexts of a word using a masked word approach, in which random words are masked from a context and the algorithm tries to predict the most likely hidden word. The models are then trained on large corpora, resulting in better word and document representations. These representations are further used as input to other NLP tasks, including text classification and question answering, in a process called transfer learning, which has resulted in significant improvements of the state-of-the-art performance in the past years [ 31 ].

In this article, we investigated the use of automatic text classifiers supported by deep learning-based language models to enhance literature triage and annotation in COVID-19 living systematic review systems. Our analysis assessed the effectiveness of different individual deep learning-based language classifiers against two ensemble strategies, in which individual models are combined using either the probability sum of the predictions or a voting strategy where each classifier has a voting right and the classification decision is given to the class obtaining a majority of votes [ 32 – 34 ].

Methodology

Study design.

An overview of the study design is presented in Fig.  1 . In this retrospective machine learning-based study, we evaluated the performance of different deep learning text classifiers to categorize COVID-19 literature according to their publication type in the COVID-19 Open Access Project (COAP) living evidence database aggregator, which includes publications about SARS-CoV-2 and COVID-19 from PubMed, Embase, medRxiv, and bioRxiv [ 4 ]. Five individual classifiers were trained with the publication title, abstract, and source associated with annotation categories of a living systematic review knowledge base. Publication title, abstract, and source were imputed to the original dataset whenever missing. Remaining publications without title or abstract were excluded from the training and evaluation sets. Then, at inference time, the classifiers were applied to individual records to predict the publication category as output. Two ensemble strategies were created using these predictions [ 32 , 34 ]. The first strategy uses a voting system that takes each classifier output as a vote for a class, while the second considers the sum of the class probabilities attributed by the individual classifiers. For the voting strategy, different cutoffs for the minimal number of votes were applied to compute the final class associated with the publication.

An external file that holds a picture, illustration, etc.
Object name is 13643_2023_2247_Fig1_HTML.jpg

Overview of the study design. All articles were manually annotated and then the title, abstract, and source retrieved. In a k-fold cross-validation setting (k is set to 5 in our experiments), 5 models were fine-tuned, and each standalone model was compared against each other as well as against two types of ensemble

Model training and evaluation were performed on a dataset of articles, which were annotated manually by a crowdsourced team of people with training in epidemiology and systematic reviews [ 2 ]. Each article was manually classified across 22 sub-subclasses describing the type of COVID-19 publications according to their study design or article type (case report, ecological study, modelling study, editorial, etc.). The sub-subclasses are nested into three subclasses, namely epidemiologic study designs (EPI), basic biological or other laboratory-based research studies (BASIC) and other types of articles (OTHER). The subclasses are nested into two classes of original research (ORIGINAL) and articles that were commentaries, editorials, or narrative literature reviews (NON-ORIGINAL). The source dataset is publicly available at https://zika.ispm.unibe.ch/assets/data/pub/search_beta/ . To improve the robustness of the results, we trained and evaluated our models using a k-fold cross-validation methodology (k is set to 5 in our experiments). For each fold, 70% of the articles (~ 4.6 k publications) were used to train the model parameters, 10% unseen documents (dev set) were used to optimize the model hyperparameters, and the remaining 20% unseen documents (test set) were used to evaluate the performance of the classifier. The final performance was obtained by averaging the results obtained on the k unseen test sets. We used standard classification metrics — precision, recall, F1-score, and area under the receiver operating characteristics curve (AUC-ROC)— to assess performance of the individual models in comparison to the ensemble and the performance of the latter at different vote majority levels (i.e., simple and absolute). The experiments were performed using the Python package Hugging Face on a Linux machine with a TPU (V3–8).

Dataset description and preprocessing

The COAP data snapshot version used in our experiments contains 6365 publications annotated between 7th January and 10th December 2020. Table ​ Table1 1 shows the distribution of publications across classes, subclasses, and sub-subclasses in the COAP snapshot dataset. The categories are imbalanced for the three categorization levels, as is typically the case for real-world data. Illustratively, the BASIC: Within-host modelling sub-subclass composes only 0.5% of the collection (31 documents), while the OTHER: Comment, editorial, …, non-original sub-subclass is responsible for 27.6% (1758 documents). There are 799 documents for the BASIC subclass and 3665 documents for the EPI subclass, which accounts for 57.6% of the dataset. At the class level, the ORIGINAL class is responsible for 70.1% of the dataset, with the remaining documents (29.9%) being categorized according to the NON-ORIGINAL class.

Dataset document count and proportion by class, subclass, and sub-subclass

In the pre-processing phase, the title, abstract, and source fields were concatenated before being fed to a classifier, and each classification model used its own tokenizer in order to separate the free-text passages into tokens (words or sub-words) [ 39 – 42 ]. All model tokenizer specificities are given in their respective papers (see Table ​ Table2 2 ).

Pre-trained models used in the experiments, the corpus type used in their training, and the number of parameters per model

Classification models

In our experiments, we used the pre-trained models shown in Table ​ Table2, 2 , which were originally pre-trained using the masked language model task. In a masked language model task, large corpora, such as Medline or Wikipedia, are used to create low-dimensional word (or sub-words) representations in a context. In each training step, a sentence taken from the corpus is provided to the model with (sub-)words masked. The model is then trained to predict the masked (sub-)words for that context. The resulting model encodes contextualized (sub-)words in a low-dimensional space, and optimal tensorial representations can then be used in downstream tasks, such as text classification, a process called transfer learning. Two out of the five models (RoBERTa-base and RoBERTa-large) were pre-trained on a general corpus, created using BookCorpus and Wikipedia, while three other models (COVID-Twitter-BERT, BioBERT, and PubMedBERT) were pre-trained on biomedical corpora. Among the models trained on biomedical corpora, one was pre-trained on a COVID-19-related corpus, and one can be considered as large, gathering 340-M parameters. All specificities of the models can be found in their related literature (see Table ​ Table2 2 ).

Individual deep learning-based classifier for biomedical literature classification

Transformer models [ 43 ] with a fully connected perceptron layer on top of the output attention layer were used to discriminate sub-subclasses of given documents. Using the pre-trained language model classifiers, knowledge acquired by the model in the pre-training phase can be transferred to the specific task, during the so called fine-tuning phase, in which task-specific examples are given the original model so its parameters can be updated to the task at hand [ 30 ]. In our case, the specific classification task consists of fine-tuning the models on a subset (training set) of the manually annotated dataset, followed by the classification of documents from another unseen subset (test set) among the 22 sub-subclasses of the knowledge base. At the inference phase, the model extracts features from the document metadata (i.e., title, abstract, and source) and outputs a probability for each of the 22 sub-subclasses. As sub-subclasses are mutually exclusive, for a given document, the sum of all the probabilities across sub-subclasses is equal to 1. Additionally, predictions with respect to the subclass and class levels were computed. To do so, the probabilities for sub-subclasses belonging to a subclass (or classes) are summed. In other words, the probability of a document to be classified in a given class is the sum of the probabilities for that document to be classified in all the sub-subclasses mapped to that class, mapping as per Table ​ Table1. 1 . The predicted category, i.e., class, subclass or sub-subclass, is then defined as the highest probability across all the predicted probabilities.

Figure ​ Figure2 2 shows the publication classification workflow. The model starts with a publication containing a title, an abstract, and a source. The text contained in those three fields is concatenated, and a tokenizer splits it into tokens (e.g., words or sub-words). Each token is then linked to a token ID which allows the language model to look up for a vectorial representation of the said token. In our example, the word “Study” is split into the “Stu” and “#dy” sub-words. “Stu” is the token ID number 51 and finds its vectorial representation in the 51th model matrix row. Once retrieved, the language model will receive its vector representation v 51 as an input along with all the other token representations. The language model then gives the publication representation to a classifier, which outputs a probability for each sub-subclass.

An external file that holds a picture, illustration, etc.
Object name is 13643_2023_2247_Fig2_HTML.jpg

Publication classifier workflow. The model starts with the title, abstract, and source fields and concatenates their text contents before tokenizing it. Each model computes their predictions, and an ensemble strategy, voting or probability sum, combines them to get a final prediction

Ensemble: voting and probability sum strategies

Assembling models can be performed by making individual models vote for a category. In the default version, the final category is defined by the higher number of votes. A threshold of votes which would trigger a voting ensemble prediction can also be used. In this setting, an unknown prediction, that is, the model is unsure about the category, is possible when there is a tie or when the number of votes is below the threshold (i.e., there is no unanimity). With this ensemble strategy, only the class level (binary) is ensured to always get predictions with a threshold equal to 3 in our setting (5 models). Alternatively, a probability sum strategy can be used to create the ensemble. The idea is to sum the probabilities of the classifiers for all the categories and then take the most probable category as the ensemble classification. If not stated otherwise, the probability sum strategy would be the default ensemble as this method always gives a unique prediction in every situation. In Fig. ​ Fig.2, 2 , as an example, 3 out of 5 models predicted the EPI subclass, so the voting ensemble ended up predicting the EPI subclass. For the probability sum strategy, the sum of all subclass predictions among all the 5 models gives a score of 3.1 for the EPI subclass, which makes it the highest score among all the other subclasses. Even if in this case predictions are the same for both strategies, it is worth noting that it is not systematically the case.

Model interpretation

To get an insight of the model word impact, the integrated gradient [ 44 ] was performed using captum [ 45 ] implementation on the PubMedBERT model on the subclass level. According to this method, the higher a token scores, the more important it is to the prediction, and the score polarity implies the positive/negative classification impact. This experiment is twofold. First, about 600 never-seen documents were classified, and the 20 highest positive impact words for each subclass prediction were reported. To deal with tokenized sub-words, a word score was computed using the mean of all its sub-word compositions. Then, to reflect a more general impact of a given word for a subclass, each word was lemmatized, and the word score is computed as the mean of the respective lemmatized word scores. This way, a word and its plural would merge, for example, “simulation” and “simulations” would gather their scores and attribute their scores to the lemmatized word “simulation.” To avoid non-generalized high-impact words, only words with at least 5 occurrences were considered. In the second part of this experiment, a few publication scores were analyzed. To do so, the set of analyzed documents sampling was driven by the top-20 positive words statistics.

Statistical analysis

To evaluate our models, standard multiclass classification metrics were used, such as precision, recall, F1-score, and AUC-ROC [ 26 ]. Precision describes the proportion of correctly classified documents over all the documents being classified by the model to the same class:

where tp is the number of true positives and fp is the number of false positives. Recall describes the proportion of correctly classified documents among all the positive documents for given class:

where fn is the number of false negatives. Finally, F1-score can be formulated as the harmonic mean of the model precision and recall:

For these three metrics, the closer the result is to 1, the better is the model performance. Lastly, AUC-ROC computes the area under ROC, where the ROC plots the curve given a classification threshold of the tp rate (or recall or sensitivity) against the fp rate (or 1 — specificity):

To get a confidence interval (CI) of the AUC-ROC, a bootstrapping with a sample of n  = 2000 was computed. The 2.5% and 97.5% values of the distribution were reported to get a 95% CI. The McNemar test is used for statistical significance testing [ 46 ].

In the ranking experiments, the model predicts a ranked list of sub-subclasses according to their probabilities for a given input document. Thus, we use standard information retrieval metrics to report our results. The precision at ranking k (@k) is the precision across all the first k sub-subclasses returned by our classifiers. As it is a multi-class problem, each document belongs only to one true class; thus, the theoretical maximum precision is equal to 1/ k . By analogy, recall@k is set across the first k sub-subclasses. Conversely to precision, the more k increases, the more the recall@k should be close to 1. As there are 22 sub-subclasses, by definition, recall@22 is equal to 1. Finally, the mean average precision (MAP) @k is the mean of all the average precisions (AP) @k, which is defined as follows:

where P(i) is the precision at i position, rel(i) is a function equal to 1 if the i th returned document is relevant and equal to zero otherwise, and N Relevant is the number of documents relevant for a given query. As our classification problem is mutually exclusive, N Relevant is equal to 1 and P@1  =  R@1  =  MAP@1 . Compared to traditional classification metrics, which only consider the top model prediction, the ranking metrics help us to understand how good are the top-k classification predictions.

Classification performance

Tables ​ Tables3, 3 , ​ ,4 4 ,  5 show the performance of the different models using the F1-score metric at the class, subclass, and sub-subclass levels, respectively. The ensemble outperformed the best standalone model significantly with a micro F1-score of 89% (Table ​ (Table3). 3 ). PubMedBERT obtained the best F1-score across the standalone models for all the classes. When comparing models to each other, there is no significant improvement. Although the improvement of the ensemble with respect to the PubMedBERT model is statistically significant, it accounts for less than a point for both the micro and macro F1-scores. At the subclass level (Table ​ (Table4), 4 ), similarly to the class level, the ensemble outperformed all single models significantly but in this case for more than a percentage point for both micro and macro F1-scores (86% vs. 85% micro F1-score and 84% vs. 83% macro F1-score), and it is also consistently the best-performing model across all the subclasses. PubMedBERT was again the overall best standalone model at the subclass level, with a micro and macro F1-scores of 85% and 83%, respectively. At sub-subclass level (Table ​ (Table5), 5 ), the ensemble significantly achieved the best micro and macro average F1-score (70% and 55%), having the highest F1-score for 10 sub-subclasses, for which 3 of the improvements were statistically significant. For the standalone models, PubMedBERT had the best micro F1-score (67%), while RoBERTa -large presented the best macro F1-score (53%). The relevant gap between aggregated scores (micro and macro F1-scores) from Tables ​ Tables4 4 and ​ and5 5 suggests that there were more intra-level than inter-level misclassifications. In other words, misclassified sub-subclasses were often confused with sub-subclasses belonging to the same subclass. Finally, Table ​ Table6 6 shows the AUC-ROC performance and their respective 95% CI for each level. Here, the ensemble reports systematically a higher performance than any standalone model. When compared to BioBERT, the best standalone model in this metric, for each level, there is no CI overlap, confirming the statistically significant improvement by the ensemble model.

F1-score performance for both the models and ensemble across all the classes

a Statistically significant improvement

F1-score performance for both the models and ensemble across all the subclasses

F1-score performance for both the models and ensemble across all the sub-subclasses

AUC-ROC performance and a 95% CI for the different classification levels for the best standalone and the ensemble models

The worst-performing sub-subclasses (F1-score < 30.00), namely EPI: Other , BASIC: Basic research review , BASIC: Within-host modelling , and OTHER: Other , are all underrepresented in the dataset, accounting for only 2.0%, 2.1%, 0.5%, and 2.2%, respectively. The poor performance for these classes had a negative impact on the macro average F1-score, which is below the micro average for all the models. In opposition, in the best-performing sub-subclasses (F1-score > 70.00), namely EPI: Case report , EPI: Modelling study , EPI: Review , BASIC: Animal experiment , BASIC: Sequencing and Phylogenetics , and OTHER: Comment, editorial, …, non-original , all accounted for 3.8%, 12.7%, 11.4%, 0.7%, 3.8%, and 27.6% of the dataset, respectively. Those 6 sub-subclasses (30% of the sub-subclasses) account for about 60% of the collection yet with a high variance in their distribution. These results suggest that the number of training examples alone is not enough to explain the model performance, and that textual features in the title + abstract + source fields and/or category definition make some classes easier to be learned.

Analyses of the ensemble model

In Fig. ​ Fig.3, 3 , we analyzed major aspects of the ensemble outcomes. In Fig. ​ Fig.3A, 3 A, the ensemble precision/recall curve is plotted against the curves for the RoBERTa base and large models for the ORIGINAL class. As we can notice, the ensemble curve is consistently above both RoBERTa models, which shows the robustness of using a probability sum strategy for assembling models. The precision/recall curve obtained by the ensemble model for the 22 sub-subclasses is presented in Fig. ​ Fig.3B. 3 B. The same under-performing sub-subclasses as previously spotted in the strict classification results can be distinguished, in particular EPI: Other , BASIC: Basic research review , BASIC: Within-host modelling , and OTHER : Other (as in Table ​ Table3). 3 ). This demonstrates that the low performance obtained for these categories is not a result of the classification threshold tuning. Despite their poor performance, they are well above a random classifier baseline, which would have a theoretical constant precision of about 0.05 (1/22 sub-subclasses).

An external file that holds a picture, illustration, etc.
Object name is 13643_2023_2247_Fig3_HTML.jpg

A Precision/recall curves of the ORIGINAL class for the RoBERTa base/large and the ensemble. B Precision/recall curves obtained by the ensemble model for the sub-subclasses. Well-represented sub-subclasses usually perform better than underrepresented ones

Figure ​ Figure4 shows 4  shows the confusion matrix for the different classification levels obtained by the ensemble model. As we can see from Fig. ​ Fig.4A 4 A and B, the ensemble tends to predict EPI subclass when misclassifying a document. When switching from Fig. ​ Fig.4A 4 A to B, the EPI confusion is split from the BASIC class into both BASIC and OTHER . For the sub-subclass level (Fig. ​ (Fig.4C), 4 C), the EPI: Review class [ 13 ] was consistently confused with the BASIC: Basic research review [ 20 ]. This confusion is expected considering that both sub-subclasses refer to review documents. Moreover, the ensemble tends to get confused for some of the EPI: … study sub-subclasses, predicting often Cohort [ 4 ] instead of Case–control [ 3 ], Cross-sectional [ 5 ] instead of Qualitative [ 12 ], Modelling [ 9 ] instead of Ecological [ 7 ], and others. There is also a clear confusion cluster when the ensemble predicts Biochemical/protein structure studies [ 17 ] and Sequencing and phylogenetics [ 18 ], as these documents are often confused with some of the BASIC sub-subclasses (in particular from 15 to 19). These observations reinforce our previous hypothesis that sub-subclasses were often misclassified inside the same subclass. It becomes more evident if we focus on the sub-subclass confusion matrix by square segments as highlighted in Fig. ​ Fig.4C 4 C (horizontal and vertical gray lines): from index 1 to 14 → EPI , from index 15 to 20 → BASIC , and for index 21 and 22 → OTHER . All shady squares inside this perimeter (the majority) are intra-subclass misclassifications, while the ones outside are inter-subclass misclassifications. Lastly, a vertical line of confusion can also be observed for the OTHER: Comment, editorial, …, non-original sub-subclass predictions, which the ensemble tends to predict for a wide variety of documents (more precisely 8, 10–13, 20–21). The broad definition of this category is likely the reason for its confusion with so many other sub-subclasses.

An external file that holds a picture, illustration, etc.
Object name is 13643_2023_2247_Fig4_HTML.jpg

Confusion matrix for class ( A ), subclass ( B ), and sub-subclass ( C ). The ensemble has a higher probability of confusing sub-subclasses inside their nested subclasses and classes which is why performances tend to be higher at those higher levels

Ranking analysis

Table ​ Table7 7 shows the ranking performance for the standalone models and the ensemble. BioBERT performed better than all the other standalone models for the ranking metrics, whereas it tended to be PubMedBERT in the strict classification perspective. However, in both perspectives, the ensemble achieves the highest performance across all models. In fact, the ensemble returns the right sub-subclass in the top-1 position in 71% of cases, with precision@3 of 30% (theoretical maximum of 33%) and a recall@3 of 89%. This means that in almost 9 out 10 document classifications, the ensemble returned the correct sub-subclass in the top 3. Moreover, the ensemble got MAP@3 of 79%, representing more than 2.5 points improvement with respect to the best standalone model ( BioBERT ).

Metrics per label using the top-k retrieved categories

P precision, R recall, MAP mean average precision. As this is a single-label task, the max value for P@3 is 1/3 (33%)

k-vote analysis

In Fig. ​ Fig.5, 5 , we show the strict classification performance for the ORIGINAL class using the ensemble for different voting thresholds. The threshold for the number of votes ( t ) corresponds to the minimal number of votes for a category required for the ensemble to trigger a classification decision. Differently, the probability threshold per vote ( t v ) refers to the probability threshold a single model needs to reach to vote for a given category. When such a probability threshold is not met, the model would not be allowed to vote. Such voting strategies make unknown predictions possible, reducing the size of the classification set. In addition to static voting thresholds [ 3 – 5 ], a dynamic threshold, for majority and unanimity, is introduced where the total of votes can change depending on unknown predictions for a given classifier. This means that if 2 classifiers (out of 5) were to predict unknown for a publication, the dynamic majority and unanimity thresholds would be set at 2 and 3, respectively.

An external file that holds a picture, illustration, etc.
Object name is 13643_2023_2247_Fig5_HTML.jpg

F1-score ( A )/precision ( B )/recall ( C ) for the ORIGINAL class with respect to a probability threshold per vote when using the voting strategy across the predictions on the class level. Using different thresholds improves considerably performance while reducing the number of predicted publications

The behavior of the ORIGINAL class prediction in terms of F1-score is presented in Fig. ​ Fig.5A. 5 A. As it is a binary problem, setting a dynamic majority and a static one ( t  =  3 ) while t v  = 0.5 produced the same results, a full size dot placed around 92%. This phenomenon is possible because there will always be a predicted class that has more than t v  = 0.5; hence, all the models end up voting. Overall, there is an average of about 93% F1-score on most of the dataset across all the t v when using majority voting rules and 97% F1-score on a subset of about 80% of the dataset when using the static unanimity voting rule. In other words, for the ORIGINAL class, confident results can be obtained (about 4 points F1-score growth) on a subset of the collection (representing about 80% of the collection) when switching from a majority to static unanimity voting rule. The respective performance in terms of precision and recall metrics is shown in Fig. ​ Fig.5B 5 B and C. We can notice that recall is consistently higher than precision, which means that this ensemble strategy is better at retrieving ORIGINAL articles than refining the selection. The observed trend is similar to the F1-score performance, where we trade a 100% dataset classification and a precision of about 91.5%, for a precision of about 96% on about 80% of the dataset with a fixed t v  = 0.5 when switching from a majority to a static unanimity voting rule. A recall of about 99% and a F1-score of about 98.5% are achieved on 50% of the subset when setting t v  = 0.99 and t  = 5, enabling the classification of half of the publications with almost no mistakes.

Figures ​ Figures6A 6 A to C show the top 20 positive impact words for EPI , BASIC , and OTHER subclasses. When taking a close look at some lexical fields, in the EPI subclass for instance, documents containing “modeling,” “mathematical,” “modelling,” “simulation,” “simulated,” and “equation” are all related to the EPI: Modelling study sub-subclass. Indeed, in the 38 documents subset containing at least one of those words, 37 were classified by the model as EPI: Modelling study . In BASIC , the same applies for “seq” and “sequence” lexicons, where 27 publications out of 28 were classified by the model as either BASIC: Sequencing and phylogenetics or BASIC: Biochemical/protein structure studies . In other words, the model clearly seems to retain high importance words at the sub-subclass level, which makes sense as it is the level the model was fine-tuned on. As for OTHER , it seems the classifier attributes a lot of credit to the word “viewpoint” for any OTHER: Comment, editorial, …, non-original publications, with 7 out of 7 publications containing the word classified as so.

An external file that holds a picture, illustration, etc.
Object name is 13643_2023_2247_Fig6_HTML.jpg

A , B , and C Top 20 positive impact words for either EPI ( A ), BASIC ( B ), or OTHER ( C ) subclasses when taking the integrated gradient on a never-seen set of about 600 documents. D , E , and F Classification examples with a focus on passages with impact word scores

Figures ​ Figures6D to 6 D to F depict three publications highlighted using their integrated gradient scores. Publication in Fig. ​ Fig.6 6 D 1 was chosen because it illustrates the usage of the top BASIC impact words, whereas publications in Fig. ​ Fig.6 6 E 2 and F 3 were selected because they emphasize the highest EPI impact words while giving an example of a negative impact word. In Fig. ​ Fig.6D, 6 D, the model predicts the BASIC label with 98% probability, and the impact words seem to focus on the “sequence analysis” part, with “sequence” being the top impact word in average for that subclass. A look at the sub-subclass prediction level gives a probability of about 95% for the BASIC: Sequencing and phylogenetics sub-subclass. In Fig. ​ Fig.6E, 6 E, there is an example of a “sectional” occurrence, the reported most important word for the subclass EPI . In our set, the word appears in 7 documents, each time along with the words “cross” and “study.” This publication is classified in EPI: Cross-sectional study sub-subclass with a probability of 96%. Interestingly, all 7 documents were classified as EPI: Cross-sectional study except for the publication of Fig. ​ Fig.6F 6 F which was classified as EPI: Cohort study with 74% probability, and, for which, the classifier seems to give more importance to the word “retrospective” in the methods section than to “sectional” in the design section. As both sub-subclasses are nested into the same subclass, the publication is still classified in the EPI subclass with a high probability of 98%.

In this article, we introduce an efficient methodology to assist epidemiologists and biomedical curators to screen articles for inclusion in living systematic reviews by providing a COVID-19 literature triage solution based on deep learning methods. Supported by an existing manually classified collection, we proposed a classification method that automatically assigns categories from a living evidence knowledge base to scientific documents using BERT-like language models, based on which we proposed two methods to combine individual model predictions (probability sum and voting). The results demonstrate that the ensemble performs consistently better than any standalone model, statistically improving upon the best standalone baseline on both strict classification and ranking tasks.

Error analyses for the living evidence dataset used in our experiments showed that classification confusion often happens at the intracategory level. It helped to explain the difference of performance observed when zooming from sub-subclass to class level, for which micro F1-score goes from almost 70% to almost 90%, respectively. We believe that in this case, there are important patterns within categories that the machine learning models can identify and exploit to provide the correct predictions at the class and subclass levels. On the other hand, at the sub-subclass level, we expect that the documents could be often related to more than one category, that is, they are mostly within one category but may also contain information associated with another category, which could lead to the confusion of the classifier when assigning the sub-subclass, a phenomenon which also occurs during the human annotation. Hence, we believe that a multi-label assignment strategy at the sub-subclass level could be an interesting alternative in the original annotation protocol.

Given the strong performance of the proposed classifier, it could be used to support annotation of scientific articles and help to speed up, augment, and scale up epidemiological reviews and biomedical curation. When looking at the problem from a ranking perspective, in which the system suggests a list of sub-subclasses for a given article, the ensemble returned the right category in its top 3 suggestions for almost 90% of the cases. Such a robust performance could help augment the annotation process, for example, by enabling human annotators to double the number of screened articles, replacing an annotator by a machine annotation in the standard double annotation process. In this setting, if the category proposed by the human annotator matched one of the top 3 categories proposed by the automatic classifier, this category would be deemed validated. Otherwise, it would be sent to a senior annotator for a final decision on the remaining 10% of the cases. Considering that a typical inter-annotator agreement in the health and life sciences field is around 80% [ 47 ], this setup could reduce the number of human resources required by at least 50% while maintaining the high quality of the annotations. Alternatively, when using a voting strategy with a confidence threshold, we showed that our method was capable of robust and superior performances in a subset of the collection on the class level (about 98.5% F1-score on 50% of the dataset). This approach could be used for example in the triage process, when a large batch of articles needs to be classified, thus scaling up the classification process.

The interpretability analysis showed that the model is not a complete black box as it is often the case in deep learning applications. Using the integrated gradient method helped to understand why the model classified a publication according to a sub-subclass instead of another. These results could be additionally used by annotation experts as a tool to highlight documents during the curation process. It would also be interesting to investigate the results of this analysis at the subclass level, which we believe could lead to a lexicon defining each subclass. Such approaches could then be combined to get multiple views by category level, which could be further assembled to get better publication insights and perhaps better screening results. We leave this investigation for future works.

A main limitation of the study is that it uses a dataset of only one living evidence knowledge base to train and evaluate the models. Thus, it is unclear how the proposed methodology will generalize to corpora and categories used in other reviews and living evidence knowledge bases. That said, given the strong performance obtained in other corpus types by a similar methodology [ 34 ], we believe that it shall generalize well. Second, in our experiments, we fail to explore the full contents of the articles. This is due to the unavailability of the full text for a large portion of the collection due to either paywall or restriction by publishers to process full text by NLP pipelines. Additionally, as the time complexity of the models used are quadratic with the number of words, the computation time becomes prohibitive as we move from abstract to full-text content. Nevertheless, we believe that valuable information supporting the classification can sometimes only be found in the full text of the manuscripts. An extended version of the approach could investigate such corpora.

Conclusions

In this work, we described an effective methodology to perform automatic classification of COVID-19-related literature to support creation of systematic living reviews and living evidence knowledge bases. The proposed ensemble model provided strong (semi-)automatic classification performance, significantly outperforming standalone methods, and enabled the categorization of a subset of the collection with improved accuracy. Hence, this approach could serve as an alternative assistant to professionals dealing with the COVID-19 pandemic literature outbreak. Ultimately, our method provides a performant and generic procedure, enabling efficient annotation of important volumes of scientific literature, which could be leveraged to assist experts in different literature classification tasks and extended to different types of review methodologies.

Acknowledgements

Lucia Araujo-Chaveron, Ingrid Arevalo-Rodriguez, Muge Cevik, Agustín Ciapponi, Muhammad Irfanul Alam, Kaspar Meili, Eric A. Meyerowitz, Nirmala Prajapati, Xueting Qiu, Aaron Richterman, William Gildardo Robles-Rodríguez, Shabnam Thapa, and Ivan Zhelyazkov annotated records in the COVID-19 Open Access Project living evidence database.

Abbreviations

Authors’ contributions.

JK designed and implemented the models and ran the experiments and analyses. JK, DT, and QH wrote the manuscript draft. NB created the benchmark dataset. DT, PA, and NL conceived the experiments. MC, HI, and LH programmed and maintained the COVID-19 Open Access Project living evidence database. DBG and AMI organized the annotation of study design in the study records. All authors reviewed and approved the manuscript.

Open access funding provided by University of Geneva. This project has been supported by CINECA (UE H2020 Grant No. 825775 and Canadian Institute of Health Research (CIHR) Grant No. 404896), Innosuisse project funding number 41013.1 IP-ICT, Swiss National Science Foundation (project number 176233), and European Union Horizon 2020 research and innovation program — project EpiPose (grant agreement number 101003688).

Availability of data and materials

Declarations.

Not applicable.

The authors declare that they have no competing interests.

1 https://www.biorxiv.org/content/10.1101/2020.04.07.029488v1.full

2 https://pubmed.ncbi.nlm.nih.gov/32237161/

3 https://pubmed.ncbi.nlm.nih.gov/32237161/

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Julien Knafou, Email: [email protected] .

Quentin Haas, Email: [email protected] .

Nikolay Borissov, Email: [email protected] .

Michel Counotte, Email: [email protected] .

Nicola Low, Email: [email protected] .

Hira Imeri, Email: [email protected] .

Aziz Mert Ipekci, Email: [email protected] .

Diana Buitrago-Garcia, Email: [email protected] .

Leonie Heron, Email: [email protected] .

Poorya Amini, Email: [email protected] .

Douglas Teodoro, Email: [email protected] .

Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

The repository contains code scripts for replicating experiments in the paper "Supporting Systematic Literature Reviews Using Deep-Learning-Based Language Models". In this paper, we address the tedious process of identifying relevant primary studies during the conduct phase of a Systematic Literature Review. For this purpose, we use deep learnin…

Manoj-Borkar/Supporting-SLR-Using-DL-Based-Language-Models

Folders and files, repository files navigation, supporting-slr-using-dl-based-language-models.

The repository contains code scripts for replicating experiments in the paper "Supporting Systematic Literature Reviews Using Deep-Learning-Based Language Models". In this paper, we address the tedious process of identifying relevant primary studies during the conduct phase of a Systematic Literature Review. For this purpose, we use deep learning architectures in the form of the two language models BERT and S-BERT to learn embedded representations and cluster on them to semi-automate this phase, and thus support the entire SLR process.

The methodology is mainly divided into three parts : Extracting embeddings using Language models such as BERT and SBERT , Weightage schemes to obtain document-level representations from weighting important sentences of the document , Clustering on embeddings(weighted/unweighted) to obtain clusters of relevant and non-relevant documents

Setup Instructions:

  • Clone the repository
  • Create a python virtual environment and activate it. (Please follow the link https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/ for more information on virtual environment.
  • Install all the required dependencies using the file: requirements.txt
  • config.py - Contains all the necessary configuration settings needed to run the experiments.
  • main.py - Main entry point after defining the desired configurations.
  • embeddings.py - Contains scripts to make use of the three models, namely 'bert-base-cased', SentenceTransformer('paraphrase-distilroberta-base-v1') and baseline model TfidfTransformer.
  • weightage.py - Methods implementing the weightage scheme mentioned in the paper.
  • preprocessing.py - For handling the data cleaning and pre-processing.
  • result_visualiser.py - To compile and store the prediction results in terms of metrics such as Fowlkes-Mallow Index and Adjusted Rand Index score. Also displays other information such as number of clusters , additional documents as part of the cluster.
  • Python 100.0%

Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature

Affiliations.

  • 1 University of Applied Sciences and Arts of Western Switzerland (HES-SO), Rue de la Tambourine 17, 1227, Geneva, Switzerland. [email protected].
  • 2 Risklick AG, Bern, Switzerland.
  • 3 University of Applied Sciences and Arts of Western Switzerland (HES-SO), Rue de la Tambourine 17, 1227, Geneva, Switzerland.
  • 4 CTU Bern, University of Bern, Bern, Switzerland.
  • 5 Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland.
  • 6 Wageningen Bioveterinary Research, Wageningen University & Research, Wageningen, The Netherlands.
  • 7 University of Applied Sciences and Arts of Western Switzerland (HES-SO), Rue de la Tambourine 17, 1227, Geneva, Switzerland. [email protected].
  • 8 Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland. [email protected].
  • PMID: 37277872
  • PMCID: PMC10240481
  • DOI: 10.1186/s13643-023-02247-9

Background: The COVID-19 pandemic has led to an unprecedented amount of scientific publications, growing at a pace never seen before. Multiple living systematic reviews have been developed to assist professionals with up-to-date and trustworthy health information, but it is increasingly challenging for systematic reviewers to keep up with the evidence in electronic databases. We aimed to investigate deep learning-based machine learning algorithms to classify COVID-19-related publications to help scale up the epidemiological curation process.

Methods: In this retrospective study, five different pre-trained deep learning-based language models were fine-tuned on a dataset of 6365 publications manually classified into two classes, three subclasses, and 22 sub-subclasses relevant for epidemiological triage purposes. In a k-fold cross-validation setting, each standalone model was assessed on a classification task and compared against an ensemble, which takes the standalone model predictions as input and uses different strategies to infer the optimal article class. A ranking task was also considered, in which the model outputs a ranked list of sub-subclasses associated with the article.

Results: The ensemble model significantly outperformed the standalone classifiers, achieving a F1-score of 89.2 at the class level of the classification task. The difference between the standalone and ensemble models increases at the sub-subclass level, where the ensemble reaches a micro F1-score of 70% against 67% for the best-performing standalone model. For the ranking task, the ensemble obtained the highest recall@3, with a performance of 89%. Using an unanimity voting rule, the ensemble can provide predictions with higher confidence on a subset of the data, achieving detection of original papers with a F1-score up to 97% on a subset of 80% of the collection instead of 93% on the whole dataset.

Conclusion: This study shows the potential of using deep learning language models to perform triage of COVID-19 references efficiently and support epidemiological curation and review. The ensemble consistently and significantly outperforms any standalone model. Fine-tuning the voting strategy thresholds is an interesting alternative to annotate a subset with higher predictive confidence.

Keywords: COVID-19; Deep learning; Language model; Literature screening; Living systematic review; Text classification; Transfer learning.

© 2023. The Author(s).

Publication types

  • Research Support, Non-U.S. Gov't
  • Deep Learning*
  • Retrospective Studies

Grants and funding

  • 404896/CIHR/Canada
  • Open access
  • Published: 05 June 2023

Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature

  • Julien Knafou   ORCID: orcid.org/0000-0002-9086-4982 1 ,
  • Quentin Haas 2 ,
  • Nikolay Borissov 1 , 3 ,
  • Michel Counotte 4 , 5 ,
  • Nicola Low 4 ,
  • Hira Imeri 4 ,
  • Aziz Mert Ipekci 4 ,
  • Diana Buitrago-Garcia 4 ,
  • Leonie Heron 4 ,
  • Poorya Amini 2 , 3 &
  • Douglas Teodoro 1 , 6  

Systematic Reviews volume  12 , Article number:  94 ( 2023 ) Cite this article

1545 Accesses

8 Altmetric

Metrics details

The COVID-19 pandemic has led to an unprecedented amount of scientific publications, growing at a pace never seen before. Multiple living systematic reviews have been developed to assist professionals with up-to-date and trustworthy health information, but it is increasingly challenging for systematic reviewers to keep up with the evidence in electronic databases. We aimed to investigate deep learning-based machine learning algorithms to classify COVID-19-related publications to help scale up the epidemiological curation process.

In this retrospective study, five different pre-trained deep learning-based language models were fine-tuned on a dataset of 6365 publications manually classified into two classes, three subclasses, and 22 sub-subclasses relevant for epidemiological triage purposes. In a k -fold cross-validation setting, each standalone model was assessed on a classification task and compared against an ensemble, which takes the standalone model predictions as input and uses different strategies to infer the optimal article class. A ranking task was also considered, in which the model outputs a ranked list of sub-subclasses associated with the article.

The ensemble model significantly outperformed the standalone classifiers, achieving a F1-score of 89.2 at the class level of the classification task. The difference between the standalone and ensemble models increases at the sub-subclass level, where the ensemble reaches a micro F1-score of 70% against 67% for the best-performing standalone model. For the ranking task, the ensemble obtained the highest recall@3, with a performance of 89%. Using an unanimity voting rule, the ensemble can provide predictions with higher confidence on a subset of the data, achieving detection of original papers with a F1-score up to 97% on a subset of 80% of the collection instead of 93% on the whole dataset.

This study shows the potential of using deep learning language models to perform triage of COVID-19 references efficiently and support epidemiological curation and review. The ensemble consistently and significantly outperforms any standalone model. Fine-tuning the voting strategy thresholds is an interesting alternative to annotate a subset with higher predictive confidence.

Peer Review reports

The pandemic coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has led to a historic wave of scientific publications in the biomedical literature [ 1 , 2 ]. As of the beginning of the pandemic, scientific publications related to SARS-CoV-2 and COVID-19 came from the most diverse domains and became available in a myriad of digital repositories (preprint servers, technical reports, peer-reviewed scientific journals, etc.) [ 3 ]. This outbreak of publications grew at an unprecedented rate. In this context, it became challenging for medical experts and epidemiologists to follow the latest scientific developments and for curators to manually review and annotate all the available COVID-19 literature to consolidate the fast-moving existing body of knowledge [ 1 ].

Several methods for producing living systematic reviews have been proposed to provide up-to-date support for professionals dealing with the pace, amount, and complexity of the COVID-19-related literature [ 4 , 5 , 6 , 7 ]. A living systematic review describes a review methodology that allows updating information as soon as new evidence becomes available, rather than the methods applied to classic, time-restricted systematic reviews [ 8 , 9 ]. Moreover, living evidence can narrow the gap between knowledge and practice, as fresh publication findings are swiftly integrated in scientifically informed guidelines [ 5 , 6 , 9 ]. However, the maintenance of living evidence systems still requires continuous manual curation from highly qualified human resources [ 10 , 11 ]. One of the most time-consuming tasks is to screen the titles and/or abstracts resulting from a literature search and to exclude articles that are clearly ineligible, which may comprise a third or more of all records [ 2 ].

To address this paradigm, (semi-)automatic curation systems based on text mining and natural language processing (NLP) technologies have been developed to support review and annotation of large literature corpora [ 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 ]. These systems support the identification and ranking of relevant articles, the categorization of the selected documents in classes and subclasses for reviewing procedures, and enable information extraction from text passages (e.g., identification of disease passages). For example, Textpresso Central [ 16 ] provides a platform that allows users to create a customized annotated corpus by uploading and processing documents of their choosing. Once documents are loaded, personalized curation searches and pipelines can be applied. PubTator Central [ 19 ] is a service for viewing and retrieving bioconcept annotations in full-text biomedical articles. It comprises state-of-the-art text mining models for annotation of several biomedical entities, such as genes and proteins, diseases, chemicals, and species. SIBiLS [ 20 ] provide an optimized search engine in the biological literature by augmenting its contents with keywords and standardized entities. Variomes [ 22 ] are a system that can perform triage of publication to support evidence-based decision. Finally, PubTerm [ 13 ] enables the organization of abstracts by terms, using the co-occurrence of terms or by specific phrases, among others, to facilitate the biomedical curation process.

Automatic text classification appears as an essential methodology to ensure high quality of living evidence updates. Text classification consists of assigning categorical labels to a given text passage (e.g., an abstract) based on its similarity to the existing labeled examples [ 23 , 24 , 25 ]. Classical text classifiers use statistical document representations, in which the relevance of a word to a document is proportional to its frequency in the document and inversely proportional to its frequency in the collection (the so-called term frequency-inverse document frequency (tf-idf) framework), to create a vectorial representations of the documents [ 26 ]. These representations are then used in machine learning models, such as logistic regression and k-nearest neighbors, to learn a mapping function between the input text and the output classes [ 27 , 28 ]. The trained models can then predict the predefined labels for new input representations. These models are however limited as they essentially fail to capture the sequential nature of text and the context in which words are embedded.

To overcome the limitations of the tf-idf framework, state-of-the-art text classifiers use deep learning-based language models to create word and document contextual representations, with improved syntactic and semantic features [ 29 ]. Language models are a particular type of probabilistic model that, given a sequence of words, compute the probability distribution of the next word. Recent deep learning-based language models, such as the Bidirectional Encoder Representations of Transformers (BERT) [ 30 ], learn word representations considering both the forward- and backward-direction contexts of a word using a masked word approach, in which random words are masked from a context and the algorithm tries to predict the most likely hidden word. The models are then trained on large corpora, resulting in better word and document representations. These representations are further used as input to other NLP tasks, including text classification and question answering, in a process called transfer learning, which has resulted in significant improvements of the state-of-the-art performance in the past years [ 31 ].

In this article, we investigated the use of automatic text classifiers supported by deep learning-based language models to enhance literature triage and annotation in COVID-19 living systematic review systems. Our analysis assessed the effectiveness of different individual deep learning-based language classifiers against two ensemble strategies, in which individual models are combined using either the probability sum of the predictions or a voting strategy where each classifier has a voting right and the classification decision is given to the class obtaining a majority of votes [ 32 , 33 , 34 ].

Methodology

Study design.

An overview of the study design is presented in Fig.  1 . In this retrospective machine learning-based study, we evaluated the performance of different deep learning text classifiers to categorize COVID-19 literature according to their publication type in the COVID-19 Open Access Project (COAP) living evidence database aggregator, which includes publications about SARS-CoV-2 and COVID-19 from PubMed, Embase, medRxiv, and bioRxiv [ 4 ]. Five individual classifiers were trained with the publication title, abstract, and source associated with annotation categories of a living systematic review knowledge base. Publication title, abstract, and source were imputed to the original dataset whenever missing. Remaining publications without title or abstract were excluded from the training and evaluation sets. Then, at inference time, the classifiers were applied to individual records to predict the publication category as output. Two ensemble strategies were created using these predictions [ 32 , 34 ]. The first strategy uses a voting system that takes each classifier output as a vote for a class, while the second considers the sum of the class probabilities attributed by the individual classifiers. For the voting strategy, different cutoffs for the minimal number of votes were applied to compute the final class associated with the publication.

figure 1

Overview of the study design. All articles were manually annotated and then the title, abstract, and source retrieved. In a k-fold cross-validation setting (k is set to 5 in our experiments), 5 models were fine-tuned, and each standalone model was compared against each other as well as against two types of ensemble

Model training and evaluation were performed on a dataset of articles, which were annotated manually by a crowdsourced team of people with training in epidemiology and systematic reviews [ 2 ]. Each article was manually classified across 22 sub-subclasses describing the type of COVID-19 publications according to their study design or article type (case report, ecological study, modelling study, editorial, etc.). The sub-subclasses are nested into three subclasses, namely epidemiologic study designs (EPI), basic biological or other laboratory-based research studies (BASIC) and other types of articles (OTHER). The subclasses are nested into two classes of original research (ORIGINAL) and articles that were commentaries, editorials, or narrative literature reviews (NON-ORIGINAL). The source dataset is publicly available at https://zika.ispm.unibe.ch/assets/data/pub/search_beta/ . To improve the robustness of the results, we trained and evaluated our models using a k-fold cross-validation methodology (k is set to 5 in our experiments). For each fold, 70% of the articles (~ 4.6 k publications) were used to train the model parameters, 10% unseen documents (dev set) were used to optimize the model hyperparameters, and the remaining 20% unseen documents (test set) were used to evaluate the performance of the classifier. The final performance was obtained by averaging the results obtained on the k unseen test sets. We used standard classification metrics — precision, recall, F1-score, and area under the receiver operating characteristics curve (AUC-ROC)— to assess performance of the individual models in comparison to the ensemble and the performance of the latter at different vote majority levels (i.e., simple and absolute). The experiments were performed using the Python package Hugging Face on a Linux machine with a TPU (V3–8).

Dataset description and preprocessing

The COAP data snapshot version used in our experiments contains 6365 publications annotated between 7th January and 10th December 2020. Table 1 shows the distribution of publications across classes, subclasses, and sub-subclasses in the COAP snapshot dataset. The categories are imbalanced for the three categorization levels, as is typically the case for real-world data. Illustratively, the BASIC: Within-host modelling sub-subclass composes only 0.5% of the collection (31 documents), while the OTHER: Comment, editorial, …, non-original sub-subclass is responsible for 27.6% (1758 documents). There are 799 documents for the BASIC subclass and 3665 documents for the EPI subclass, which accounts for 57.6% of the dataset. At the class level, the ORIGINAL class is responsible for 70.1% of the dataset, with the remaining documents (29.9%) being categorized according to the NON-ORIGINAL class.

In the pre-processing phase, the title, abstract, and source fields were concatenated before being fed to a classifier, and each classification model used its own tokenizer in order to separate the free-text passages into tokens (words or sub-words) [ 39 , 40 , 41 , 42 ]. All model tokenizer specificities are given in their respective papers (see Table 2 ).

Classification models

In our experiments, we used the pre-trained models shown in Table 2 , which were originally pre-trained using the masked language model task. In a masked language model task, large corpora, such as Medline or Wikipedia, are used to create low-dimensional word (or sub-words) representations in a context. In each training step, a sentence taken from the corpus is provided to the model with (sub-)words masked. The model is then trained to predict the masked (sub-)words for that context. The resulting model encodes contextualized (sub-)words in a low-dimensional space, and optimal tensorial representations can then be used in downstream tasks, such as text classification, a process called transfer learning. Two out of the five models (RoBERTa-base and RoBERTa-large) were pre-trained on a general corpus, created using BookCorpus and Wikipedia, while three other models (COVID-Twitter-BERT, BioBERT, and PubMedBERT) were pre-trained on biomedical corpora. Among the models trained on biomedical corpora, one was pre-trained on a COVID-19-related corpus, and one can be considered as large, gathering 340-M parameters. All specificities of the models can be found in their related literature (see Table 2 ).

Individual deep learning-based classifier for biomedical literature classification

Transformer models [ 43 ] with a fully connected perceptron layer on top of the output attention layer were used to discriminate sub-subclasses of given documents. Using the pre-trained language model classifiers, knowledge acquired by the model in the pre-training phase can be transferred to the specific task, during the so called fine-tuning phase, in which task-specific examples are given the original model so its parameters can be updated to the task at hand [ 30 ]. In our case, the specific classification task consists of fine-tuning the models on a subset (training set) of the manually annotated dataset, followed by the classification of documents from another unseen subset (test set) among the 22 sub-subclasses of the knowledge base. At the inference phase, the model extracts features from the document metadata (i.e., title, abstract, and source) and outputs a probability for each of the 22 sub-subclasses. As sub-subclasses are mutually exclusive, for a given document, the sum of all the probabilities across sub-subclasses is equal to 1. Additionally, predictions with respect to the subclass and class levels were computed. To do so, the probabilities for sub-subclasses belonging to a subclass (or classes) are summed. In other words, the probability of a document to be classified in a given class is the sum of the probabilities for that document to be classified in all the sub-subclasses mapped to that class, mapping as per Table 1 . The predicted category, i.e., class, subclass or sub-subclass, is then defined as the highest probability across all the predicted probabilities.

Figure 2 shows the publication classification workflow. The model starts with a publication containing a title, an abstract, and a source. The text contained in those three fields is concatenated, and a tokenizer splits it into tokens (e.g., words or sub-words). Each token is then linked to a token ID which allows the language model to look up for a vectorial representation of the said token. In our example, the word “Study” is split into the “Stu” and “#dy” sub-words. “Stu” is the token ID number 51 and finds its vectorial representation in the 51th model matrix row. Once retrieved, the language model will receive its vector representation \({v}_{51}\) as an input along with all the other token representations. The language model then gives the publication representation to a classifier, which outputs a probability for each sub-subclass.

figure 2

Publication classifier workflow. The model starts with the title, abstract, and source fields and concatenates their text contents before tokenizing it. Each model computes their predictions, and an ensemble strategy, voting or probability sum, combines them to get a final prediction

Ensemble: voting and probability sum strategies

Assembling models can be performed by making individual models vote for a category. In the default version, the final category is defined by the higher number of votes. A threshold of votes which would trigger a voting ensemble prediction can also be used. In this setting, an unknown prediction, that is, the model is unsure about the category, is possible when there is a tie or when the number of votes is below the threshold (i.e., there is no unanimity). With this ensemble strategy, only the class level (binary) is ensured to always get predictions with a threshold equal to 3 in our setting (5 models). Alternatively, a probability sum strategy can be used to create the ensemble. The idea is to sum the probabilities of the classifiers for all the categories and then take the most probable category as the ensemble classification. If not stated otherwise, the probability sum strategy would be the default ensemble as this method always gives a unique prediction in every situation. In Fig. 2 , as an example, 3 out of 5 models predicted the EPI subclass, so the voting ensemble ended up predicting the EPI subclass. For the probability sum strategy, the sum of all subclass predictions among all the 5 models gives a score of 3.1 for the EPI subclass, which makes it the highest score among all the other subclasses. Even if in this case predictions are the same for both strategies, it is worth noting that it is not systematically the case.

Model interpretation

To get an insight of the model word impact, the integrated gradient [ 44 ] was performed using captum [ 45 ] implementation on the PubMedBERT model on the subclass level. According to this method, the higher a token scores, the more important it is to the prediction, and the score polarity implies the positive/negative classification impact. This experiment is twofold. First, about 600 never-seen documents were classified, and the 20 highest positive impact words for each subclass prediction were reported. To deal with tokenized sub-words, a word score was computed using the mean of all its sub-word compositions. Then, to reflect a more general impact of a given word for a subclass, each word was lemmatized, and the word score is computed as the mean of the respective lemmatized word scores. This way, a word and its plural would merge, for example, “simulation” and “simulations” would gather their scores and attribute their scores to the lemmatized word “simulation.” To avoid non-generalized high-impact words, only words with at least 5 occurrences were considered. In the second part of this experiment, a few publication scores were analyzed. To do so, the set of analyzed documents sampling was driven by the top-20 positive words statistics.

Statistical analysis

To evaluate our models, standard multiclass classification metrics were used, such as precision, recall, F1-score, and AUC-ROC [ 26 ]. Precision describes the proportion of correctly classified documents over all the documents being classified by the model to the same class:

where tp is the number of true positives and fp is the number of false positives. Recall describes the proportion of correctly classified documents among all the positive documents for given class:

where fn is the number of false negatives. Finally, F1-score can be formulated as the harmonic mean of the model precision and recall:

For these three metrics, the closer the result is to 1, the better is the model performance. Lastly, AUC-ROC computes the area under ROC, where the ROC plots the curve given a classification threshold of the tp rate (or recall or sensitivity) against the fp rate (or 1 — specificity):

To get a confidence interval (CI) of the AUC-ROC, a bootstrapping with a sample of n  = 2000 was computed. The 2.5% and 97.5% values of the distribution were reported to get a 95% CI. The McNemar test is used for statistical significance testing [ 46 ].

In the ranking experiments, the model predicts a ranked list of sub-subclasses according to their probabilities for a given input document. Thus, we use standard information retrieval metrics to report our results. The precision at ranking k (@k) is the precision across all the first k sub-subclasses returned by our classifiers. As it is a multi-class problem, each document belongs only to one true class; thus, the theoretical maximum precision is equal to 1/ k . By analogy, recall@k is set across the first k sub-subclasses. Conversely to precision, the more k increases, the more the recall@k should be close to 1. As there are 22 sub-subclasses, by definition, recall@22 is equal to 1. Finally, the mean average precision (MAP) @k is the mean of all the average precisions (AP) @k, which is defined as follows:

where P(i) is the precision at i position, rel(i) is a function equal to 1 if the i th returned document is relevant and equal to zero otherwise, and N Relevant is the number of documents relevant for a given query. As our classification problem is mutually exclusive, N Relevant is equal to 1 and P@1  =  R@1  =  MAP@1 . Compared to traditional classification metrics, which only consider the top model prediction, the ranking metrics help us to understand how good are the top-k classification predictions.

Classification performance

Tables 3 , 4 ,  5 show the performance of the different models using the F1-score metric at the class, subclass, and sub-subclass levels, respectively. The ensemble outperformed the best standalone model significantly with a micro F1-score of 89% (Table 3 ). PubMedBERT obtained the best F1-score across the standalone models for all the classes. When comparing models to each other, there is no significant improvement. Although the improvement of the ensemble with respect to the PubMedBERT model is statistically significant, it accounts for less than a point for both the micro and macro F1-scores. At the subclass level (Table 4 ), similarly to the class level, the ensemble outperformed all single models significantly but in this case for more than a percentage point for both micro and macro F1-scores (86% vs. 85% micro F1-score and 84% vs. 83% macro F1-score), and it is also consistently the best-performing model across all the subclasses. PubMedBERT was again the overall best standalone model at the subclass level, with a micro and macro F1-scores of 85% and 83%, respectively. At sub-subclass level (Table 5 ), the ensemble significantly achieved the best micro and macro average F1-score (70% and 55%), having the highest F1-score for 10 sub-subclasses, for which 3 of the improvements were statistically significant. For the standalone models, PubMedBERT had the best micro F1-score (67%), while RoBERTa -large presented the best macro F1-score (53%). The relevant gap between aggregated scores (micro and macro F1-scores) from Tables 4 and 5 suggests that there were more intra-level than inter-level misclassifications. In other words, misclassified sub-subclasses were often confused with sub-subclasses belonging to the same subclass. Finally, Table 6 shows the AUC-ROC performance and their respective 95% CI for each level. Here, the ensemble reports systematically a higher performance than any standalone model. When compared to BioBERT, the best standalone model in this metric, for each level, there is no CI overlap, confirming the statistically significant improvement by the ensemble model.

The worst-performing sub-subclasses (F1-score < 30.00), namely EPI: Other , BASIC: Basic research review , BASIC: Within-host modelling , and OTHER: Other , are all underrepresented in the dataset, accounting for only 2.0%, 2.1%, 0.5%, and 2.2%, respectively. The poor performance for these classes had a negative impact on the macro average F1-score, which is below the micro average for all the models. In opposition, in the best-performing sub-subclasses (F1-score > 70.00), namely EPI: Case report , EPI: Modelling study , EPI: Review , BASIC: Animal experiment , BASIC: Sequencing and Phylogenetics , and OTHER: Comment, editorial, …, non-original , all accounted for 3.8%, 12.7%, 11.4%, 0.7%, 3.8%, and 27.6% of the dataset, respectively. Those 6 sub-subclasses (30% of the sub-subclasses) account for about 60% of the collection yet with a high variance in their distribution. These results suggest that the number of training examples alone is not enough to explain the model performance, and that textual features in the title + abstract + source fields and/or category definition make some classes easier to be learned.

Analyses of the ensemble model

In Fig. 3 , we analyzed major aspects of the ensemble outcomes. In Fig. 3 A, the ensemble precision/recall curve is plotted against the curves for the RoBERTa base and large models for the ORIGINAL class. As we can notice, the ensemble curve is consistently above both RoBERTa models, which shows the robustness of using a probability sum strategy for assembling models. The precision/recall curve obtained by the ensemble model for the 22 sub-subclasses is presented in Fig. 3 B. The same under-performing sub-subclasses as previously spotted in the strict classification results can be distinguished, in particular EPI: Other , BASIC: Basic research review , BASIC: Within-host modelling , and OTHER : Other (as in Table 3 ). This demonstrates that the low performance obtained for these categories is not a result of the classification threshold tuning. Despite their poor performance, they are well above a random classifier baseline, which would have a theoretical constant precision of about 0.05 (1/22 sub-subclasses).

figure 3

A Precision/recall curves of the ORIGINAL class for the RoBERTa base/large and the ensemble. B Precision/recall curves obtained by the ensemble model for the sub-subclasses. Well-represented sub-subclasses usually perform better than underrepresented ones

Figure 4  shows the confusion matrix for the different classification levels obtained by the ensemble model. As we can see from Fig. 4 A and B, the ensemble tends to predict EPI subclass when misclassifying a document. When switching from Fig. 4 A to B, the EPI confusion is split from the BASIC class into both BASIC and OTHER . For the sub-subclass level (Fig. 4 C), the EPI: Review class [ 13 ] was consistently confused with the BASIC: Basic research review [ 20 ]. This confusion is expected considering that both sub-subclasses refer to review documents. Moreover, the ensemble tends to get confused for some of the EPI: … study sub-subclasses, predicting often Cohort [ 4 ] instead of Case–control [ 3 ], Cross-sectional [ 5 ] instead of Qualitative [ 12 ], Modelling [ 9 ] instead of Ecological [ 7 ], and others. There is also a clear confusion cluster when the ensemble predicts Biochemical/protein structure studies [ 17 ] and Sequencing and phylogenetics [ 18 ], as these documents are often confused with some of the BASIC sub-subclasses (in particular from 15 to 19). These observations reinforce our previous hypothesis that sub-subclasses were often misclassified inside the same subclass. It becomes more evident if we focus on the sub-subclass confusion matrix by square segments as highlighted in Fig. 4 C (horizontal and vertical gray lines): from index 1 to 14 \(\to\) EPI , from index 15 to 20 \(\to\) BASIC , and for index 21 and 22 \(\to\) OTHER . All shady squares inside this perimeter (the majority) are intra-subclass misclassifications, while the ones outside are inter-subclass misclassifications. Lastly, a vertical line of confusion can also be observed for the OTHER: Comment, editorial, …, non-original sub-subclass predictions, which the ensemble tends to predict for a wide variety of documents (more precisely 8, 10–13, 20–21). The broad definition of this category is likely the reason for its confusion with so many other sub-subclasses.

figure 4

Confusion matrix for class ( A ), subclass ( B ), and sub-subclass ( C ). The ensemble has a higher probability of confusing sub-subclasses inside their nested subclasses and classes which is why performances tend to be higher at those higher levels

Ranking analysis

Table 7 shows the ranking performance for the standalone models and the ensemble. BioBERT performed better than all the other standalone models for the ranking metrics, whereas it tended to be PubMedBERT in the strict classification perspective. However, in both perspectives, the ensemble achieves the highest performance across all models. In fact, the ensemble returns the right sub-subclass in the top-1 position in 71% of cases, with precision@3 of 30% (theoretical maximum of 33%) and a recall@3 of 89%. This means that in almost 9 out 10 document classifications, the ensemble returned the correct sub-subclass in the top 3. Moreover, the ensemble got MAP@3 of 79%, representing more than 2.5 points improvement with respect to the best standalone model ( BioBERT ).

k-vote analysis

In Fig. 5 , we show the strict classification performance for the ORIGINAL class using the ensemble for different voting thresholds. The threshold for the number of votes ( t ) corresponds to the minimal number of votes for a category required for the ensemble to trigger a classification decision. Differently, the probability threshold per vote ( t v ) refers to the probability threshold a single model needs to reach to vote for a given category. When such a probability threshold is not met, the model would not be allowed to vote. Such voting strategies make unknown predictions possible, reducing the size of the classification set. In addition to static voting thresholds [ 3 , 4 , 5 ], a dynamic threshold, for majority and unanimity, is introduced where the total of votes can change depending on unknown predictions for a given classifier. This means that if 2 classifiers (out of 5) were to predict unknown for a publication, the dynamic majority and unanimity thresholds would be set at 2 and 3, respectively.

figure 5

F1-score ( A )/precision ( B )/recall ( C ) for the ORIGINAL class with respect to a probability threshold per vote when using the voting strategy across the predictions on the class level. Using different thresholds improves considerably performance while reducing the number of predicted publications

The behavior of the ORIGINAL class prediction in terms of F1-score is presented in Fig. 5 A. As it is a binary problem, setting a dynamic majority and a static one ( t  =  3 ) while t v  = 0.5 produced the same results, a full size dot placed around 92%. This phenomenon is possible because there will always be a predicted class that has more than t v  = 0.5; hence, all the models end up voting. Overall, there is an average of about 93% F1-score on most of the dataset across all the t v when using majority voting rules and 97% F1-score on a subset of about 80% of the dataset when using the static unanimity voting rule. In other words, for the ORIGINAL class, confident results can be obtained (about 4 points F1-score growth) on a subset of the collection (representing about 80% of the collection) when switching from a majority to static unanimity voting rule. The respective performance in terms of precision and recall metrics is shown in Fig. 5 B and C. We can notice that recall is consistently higher than precision, which means that this ensemble strategy is better at retrieving ORIGINAL articles than refining the selection. The observed trend is similar to the F1-score performance, where we trade a 100% dataset classification and a precision of about 91.5%, for a precision of about 96% on about 80% of the dataset with a fixed t v  = 0.5 when switching from a majority to a static unanimity voting rule. A recall of about 99% and a F1-score of about 98.5% are achieved on 50% of the subset when setting t v  = 0.99 and t  = 5, enabling the classification of half of the publications with almost no mistakes.

Figures 6 A to C show the top 20 positive impact words for EPI , BASIC , and OTHER subclasses. When taking a close look at some lexical fields, in the EPI subclass for instance, documents containing “modeling,” “mathematical,” “modelling,” “simulation,” “simulated,” and “equation” are all related to the EPI: Modelling study sub-subclass. Indeed, in the 38 documents subset containing at least one of those words, 37 were classified by the model as EPI: Modelling study . In BASIC , the same applies for “seq” and “sequence” lexicons, where 27 publications out of 28 were classified by the model as either BASIC: Sequencing and phylogenetics or BASIC: Biochemical/protein structure studies . In other words, the model clearly seems to retain high importance words at the sub-subclass level, which makes sense as it is the level the model was fine-tuned on. As for OTHER , it seems the classifier attributes a lot of credit to the word “viewpoint” for any OTHER: Comment, editorial, …, non-original publications, with 7 out of 7 publications containing the word classified as so.

Figures 6 D to F depict three publications highlighted using their integrated gradient scores. Publication in Fig. 6 D Footnote 1 was chosen because it illustrates the usage of the top BASIC impact words, whereas publications in Fig. 6 E Footnote 2 and F Footnote 3 were selected because they emphasize the highest EPI impact words while giving an example of a negative impact word. In Fig. 6 D, the model predicts the BASIC label with 98% probability, and the impact words seem to focus on the “sequence analysis” part, with “sequence” being the top impact word in average for that subclass. A look at the sub-subclass prediction level gives a probability of about 95% for the BASIC: Sequencing and phylogenetics sub-subclass. In Fig. 6 E, there is an example of a “sectional” occurrence, the reported most important word for the subclass EPI . In our set, the word appears in 7 documents, each time along with the words “cross” and “study.” This publication is classified in EPI: Cross-sectional study sub-subclass with a probability of 96%. Interestingly, all 7 documents were classified as EPI: Cross-sectional study except for the publication of Fig. 6 F which was classified as EPI: Cohort study with 74% probability, and, for which, the classifier seems to give more importance to the word “retrospective” in the methods section than to “sectional” in the design section. As both sub-subclasses are nested into the same subclass, the publication is still classified in the EPI subclass with a high probability of 98%.

figure 6

A , B , and C Top 20 positive impact words for either EPI ( A ), BASIC ( B ), or OTHER ( C ) subclasses when taking the integrated gradient on a never-seen set of about 600 documents. D , E , and F Classification examples with a focus on passages with impact word scores

In this article, we introduce an efficient methodology to assist epidemiologists and biomedical curators to screen articles for inclusion in living systematic reviews by providing a COVID-19 literature triage solution based on deep learning methods. Supported by an existing manually classified collection, we proposed a classification method that automatically assigns categories from a living evidence knowledge base to scientific documents using BERT-like language models, based on which we proposed two methods to combine individual model predictions (probability sum and voting). The results demonstrate that the ensemble performs consistently better than any standalone model, statistically improving upon the best standalone baseline on both strict classification and ranking tasks.

Error analyses for the living evidence dataset used in our experiments showed that classification confusion often happens at the intracategory level. It helped to explain the difference of performance observed when zooming from sub-subclass to class level, for which micro F1-score goes from almost 70% to almost 90%, respectively. We believe that in this case, there are important patterns within categories that the machine learning models can identify and exploit to provide the correct predictions at the class and subclass levels. On the other hand, at the sub-subclass level, we expect that the documents could be often related to more than one category, that is, they are mostly within one category but may also contain information associated with another category, which could lead to the confusion of the classifier when assigning the sub-subclass, a phenomenon which also occurs during the human annotation. Hence, we believe that a multi-label assignment strategy at the sub-subclass level could be an interesting alternative in the original annotation protocol.

Given the strong performance of the proposed classifier, it could be used to support annotation of scientific articles and help to speed up, augment, and scale up epidemiological reviews and biomedical curation. When looking at the problem from a ranking perspective, in which the system suggests a list of sub-subclasses for a given article, the ensemble returned the right category in its top 3 suggestions for almost 90% of the cases. Such a robust performance could help augment the annotation process, for example, by enabling human annotators to double the number of screened articles, replacing an annotator by a machine annotation in the standard double annotation process. In this setting, if the category proposed by the human annotator matched one of the top 3 categories proposed by the automatic classifier, this category would be deemed validated. Otherwise, it would be sent to a senior annotator for a final decision on the remaining 10% of the cases. Considering that a typical inter-annotator agreement in the health and life sciences field is around 80% [ 47 ], this setup could reduce the number of human resources required by at least 50% while maintaining the high quality of the annotations. Alternatively, when using a voting strategy with a confidence threshold, we showed that our method was capable of robust and superior performances in a subset of the collection on the class level (about 98.5% F1-score on 50% of the dataset). This approach could be used for example in the triage process, when a large batch of articles needs to be classified, thus scaling up the classification process.

The interpretability analysis showed that the model is not a complete black box as it is often the case in deep learning applications. Using the integrated gradient method helped to understand why the model classified a publication according to a sub-subclass instead of another. These results could be additionally used by annotation experts as a tool to highlight documents during the curation process. It would also be interesting to investigate the results of this analysis at the subclass level, which we believe could lead to a lexicon defining each subclass. Such approaches could then be combined to get multiple views by category level, which could be further assembled to get better publication insights and perhaps better screening results. We leave this investigation for future works.

A main limitation of the study is that it uses a dataset of only one living evidence knowledge base to train and evaluate the models. Thus, it is unclear how the proposed methodology will generalize to corpora and categories used in other reviews and living evidence knowledge bases. That said, given the strong performance obtained in other corpus types by a similar methodology [ 34 ], we believe that it shall generalize well. Second, in our experiments, we fail to explore the full contents of the articles. This is due to the unavailability of the full text for a large portion of the collection due to either paywall or restriction by publishers to process full text by NLP pipelines. Additionally, as the time complexity of the models used are quadratic with the number of words, the computation time becomes prohibitive as we move from abstract to full-text content. Nevertheless, we believe that valuable information supporting the classification can sometimes only be found in the full text of the manuscripts. An extended version of the approach could investigate such corpora.

Conclusions

In this work, we described an effective methodology to perform automatic classification of COVID-19-related literature to support creation of systematic living reviews and living evidence knowledge bases. The proposed ensemble model provided strong (semi-)automatic classification performance, significantly outperforming standalone methods, and enabled the categorization of a subset of the collection with improved accuracy. Hence, this approach could serve as an alternative assistant to professionals dealing with the COVID-19 pandemic literature outbreak. Ultimately, our method provides a performant and generic procedure, enabling efficient annotation of important volumes of scientific literature, which could be leveraged to assist experts in different literature classification tasks and extended to different types of review methodologies.

Availability of data and materials

The datasets used and analyzed during the current study are available in the COAP living evidence database: https://zika.ispm.unibe.ch/assets/data/pub/ncov/ . The training, testing, and ensemble source codes are available under https://github.com/ds4dh/CovidReview .

https://www.biorxiv.org/content/10.1101/2020.04.07.029488v1.full

https://pubmed.ncbi.nlm.nih.gov/32237161/

Abbreviations

Bidirectional encoder representations of transformers

Natural language processing

COVID-19 Open Access Project

Area under the curve of the receiver operating characteristics

Confidence interval

Mean average precision

Chen Q, Allot A, Lu Z. LitCovid: an open database of COVID-19 literature. Nucleic Acids Res. 2021;49(D1):D1534–40.

Article   CAS   PubMed   Google Scholar  

Ipekci AM, Buitrago-Garcia D, Meili KW, Krauer F, Prajapati N, Thapa S, et al. Outbreaks of publications about emerging infectious diseases: the case of SARS-CoV-2 and Zika virus. BMC Med Res Methodol. 2021;50–50.

Lu Wang L, Lo K, Chandrasekhar Y, Reas R, Yang J, Eide D, et al. CORD-19: the Covid-19 Open Research Dataset. 2020 Available from: https://search.bvsalud.org/global-literature-on-novel-coronavirus-2019-ncov/resource/en/ppcovidwho-2130 . [Cited 29 Jun 2022].

Counotte M, Imeri H, Leonie H, Ipekci M, Low N. Living evidence on COVID-19. 2020 Available from: https://ispmbern.github.io/covid-19/living-review/ . [Cited 29 Jun 2022].

The COVID-NMA initiative. Available from: https://covid-nma.com/ . [Cited 29 Jun 2022].

National COVID-19 Clinical Evidence Taskforce. Available from: https://covid19evidence.net.au/ . [Cited 29 Jun 2022].

COVID-19: living systematic map of the evidence.Available from: http://eppi.ioe.ac.uk/cms/Projects/DepartmentofHealthandSocialCare/Publishedreviews/COVID-19Livingsystematicmapoftheevidence/tabid/3765/Default.aspx/ . [Cited 29 Jun 2022].

Elliott JH, Turner T, Clavisi O, Thomas J, Higgins JPT, Mavergames C, et al. Living systematic reviews: an emerging opportunity to narrow the evidence-practice gap. PLOS Med. 2014;11(2): e1001603.

Article   PubMed   PubMed Central   Google Scholar  

Tendal B, Vogel JP, McDonald S, Norris S, Cumpston M, White H, et al. Weekly updates of national living evidence-based guidelines: methods for the Australian living guidelines for care of people with COVID-19. J Clin Epidemiol. 2021;1(131):11–21.

Article   Google Scholar  

Baumgartner WA, Cohen KB, Fox LM, Acquaah-Mensah G, Hunter L. Manual curation is not sufficient for annotation of genomic databases. Bioinforma Oxf Engl. 2007;23(13):i41–8.

Article   CAS   Google Scholar  

Bourne PE, Lorsch JR, Green ED. Perspective: sustaining the big-data ecosystem. Nature. 2015;527(7576):S16-17.

Chai KEK, Lines RLJ, Gucciardi DF, Ng L. Research Screener: a machine learning tool to semi-automate abstract screening for systematic reviews. Syst Rev. 2021;10(1):93.

Garcia-Pelaez J, Rodriguez D, Medina-Molina R, Garcia-Rivas G, Jerjes-Sánchez C, Trevino V. PubTerm: a web tool for organizing, annotating and curating genes, diseases, molecules and other concepts from PubMed records. Database J Biol Databases Curation. 2019;8:2019.

Google Scholar  

Hirschman L, Burns GAPC, Krallinger M, Arighi C, Cohen KB, Valencia A, et al. Text mining for the biocuration workflow. Database. 2012;2012:bas020.

Lee K, Famiglietti ML, McMahon A, Wei CH, MacArthur JAL, Poux S, et al. Scaling up data curation using deep learning: an application to literature triage in genomic variation resources. PLOS Comput Biol. 2018;14(8): e1006390.

Müller HM, Van Auken KM, Li Y, Sternberg PW. Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature. BMC Bioinformatics. 2018;19(1):94.

O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev. 2015;4(1):5.

Van Auken K, Fey P, Berardini TZ, Dodson R, Cooper L, Li D, et al. Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR. Database J Biol Databases Curation. 2012;2012:bas040.

Wei CH, Allot A, Leaman R, Lu Z. PubTator Central: automated concept annotation for biomedical full text articles. Nucleic Acids Res. 2019;47(W1):W587–93.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Gobeill J, Caucheteur D, Michel PA, Mottin L, Pasche E, Ruch P. SIB literature services: RESTful customizable search engines in biomedical literature, enriched with automatically mapped biomedical concepts. Nucleic Acids Res. 2020;48(W1):W12–6.

Pasche E, Mottaz A, Caucheteur D, Gobeill J, Michel PA, Ruch P. Variomes: a high recall search engine to support the curation of genomic variants. Bioinformatics. 2022;38(9):2595–601.

Mottaz A, Pasche E, Michel PAA, Mottin L, Teodoro D, Ruch P. Designing an optimal expansion method to improve the recall of a genomic variant curation-support service. Stud Health Technol Inform. 2022;294:839–43.

PubMed   Google Scholar  

Dhar A, Mukherjee H, Dash NS, Roy K. Text categorization: past and present. Artif Intell Rev. 2021;54(4):3007–54.

Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv. 2002;34(1):1–47.

Teodoro D, Knafou J, Naderi N, Pasche E, Gobeill J, Arighi CN, et al. UPCLASS: a deep learning-based classifier for UniProtKB entry publications. Database. 2020;2020:baaa026.

Manning C, Schütze H. Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press; 1999. p. 718.

Teodoro, Gobeill J, Pasche E, Ruch P, Vishnyakova D, Lovis C. Automatic IPC encoding and novelty tracking for effective patent mining. Tokyo, Japan; 2010. p. 309–17.

Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2nd ed. Springer; 2009. Available from: https://web.stanford.edu/~hastie/Papers/ESLII.pdf

Peters ME, Ammar W, Bhagavatula C, Power R. Semi-supervised sequence tagging with bidirectional language models. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics; 2017. p. 1756–65. Available from: https://aclanthology.org/P17-1161 . [Cited 29 Jun 2022].

Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. ArXiv181004805 Cs. 2019 May 24 [cited 2020 May 1]; Available from: http://arxiv.org/abs/1810.04805

Aum S, Choe S. srBERT: automatic article classification model for systematic review using BERT. Syst Rev. 2021;10(1):285.

Knafou J, Naderi N, Copara J, Teodoro D, Ruch P. BiTeM at WNUT 2020 Shared Task-1: named entity recognition over wet lab protocols using an ensemble of contextual language models. In: Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020). Online: Association for Computational Linguistics; 2020. p. 305–13. Available from: https://aclanthology.org/2020.wnut-1.40 . [cited 29 Jun 2022].

Copara J, Naderi N, Knafou J, Ruch P, Teodoro D. Named entity recognition in chemical patents using ensemble of contextual language models [Internet]. arXiv; 2020 [cited 2022 Jun 29]. Available from: http://arxiv.org/abs/2007.12569

Naderi N, Knafou J, Copara J, Ruch P, Teodoro D. Ensemble of deep masked language models for effective named entity recognition in Health and Life Science Corpora. Front Res Metr Anal [Internet]. 2021 [cited 2022 Jun 29];6. Available from: https://www.frontiersin.org/article/ https://doi.org/10.3389/frma.2021.689803

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: a robustly optimized BERT pretraining approach. ArXiv190711692 Cs. 2019 Jul 26 [cited 2020 Apr 30]; Available from: http://arxiv.org/abs/1907.11692

Müller M, Salathé M, Kummervold PE. COVID-Twitter-BERT: a natural language processing model to analyse COVID-19 content on Twitter. arXiv; 2020 [cited 2022 Jun 29]. Available from: http://arxiv.org/abs/2005.07503

Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40.

Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc. 2022;3(1):1–23.

Gage P. A new algorithm for data compression. :14.

Schuster M, Nakajima K. Japanese and Korean voice search. In: International Conference on Acoustics, Speech and Signal Processing. 2012. p. 5149–52.

Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units [Internet]. arXiv; 2016 [cited 2022 Jun 29]. Available from: http://arxiv.org/abs/1508.07909

Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, et al. Google’s Neural Machine Translation system: bridging the gap between human and machine translation. arXiv; 2016 [cited 2022 Jun 29]. Available from: http://arxiv.org/abs/1609.08144

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. ArXiv170603762 Cs. 2017 Dec 5 [cited 2020 Feb 8]; Available from: http://arxiv.org/abs/1706.03762

Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks. In: Proceedings of the 34th International Conference on Machine Learning. PMLR; 2017 [cited 2022 Jun 29]. p. 3319–28. Available from: https://proceedings.mlr.press/v70/sundararajan17a.html

Captum · model interpretability for PyTorch. [cited 2022 Jun 29]. Available from: https://captum.ai/

McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika. 1947;12(2):153–7.

Wilbur WJ, Rzhetsky A, Shatkay H. New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics. 2006;7:356.

Download references

Acknowledgements

Lucia Araujo-Chaveron, Ingrid Arevalo-Rodriguez, Muge Cevik, Agustín Ciapponi, Muhammad Irfanul Alam, Kaspar Meili, Eric A. Meyerowitz, Nirmala Prajapati, Xueting Qiu, Aaron Richterman, William Gildardo Robles-Rodríguez, Shabnam Thapa, and Ivan Zhelyazkov annotated records in the COVID-19 Open Access Project living evidence database.

Open access funding provided by University of Geneva. This project has been supported by CINECA (UE H2020 Grant No. 825775 and Canadian Institute of Health Research (CIHR) Grant No. 404896), Innosuisse project funding number 41013.1 IP-ICT, Swiss National Science Foundation (project number 176233), and European Union Horizon 2020 research and innovation program — project EpiPose (grant agreement number 101003688).

Author information

Authors and affiliations.

University of Applied Sciences and Arts of Western Switzerland (HES-SO), Rue de la Tambourine 17, 1227, Geneva, Switzerland

Julien Knafou, Nikolay Borissov & Douglas Teodoro

Risklick AG, Bern, Switzerland

Quentin Haas & Poorya Amini

CTU Bern, University of Bern, Bern, Switzerland

Nikolay Borissov & Poorya Amini

Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland

Michel Counotte, Nicola Low, Hira Imeri, Aziz Mert Ipekci, Diana Buitrago-Garcia & Leonie Heron

Wageningen Bioveterinary Research, Wageningen University & Research, Wageningen, The Netherlands

Michel Counotte

Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland

Douglas Teodoro

You can also search for this author in PubMed   Google Scholar

Contributions

JK designed and implemented the models and ran the experiments and analyses. JK, DT, and QH wrote the manuscript draft. NB created the benchmark dataset. DT, PA, and NL conceived the experiments. MC, HI, and LH programmed and maintained the COVID-19 Open Access Project living evidence database. DBG and AMI organized the annotation of study design in the study records. All authors reviewed and approved the manuscript.

Corresponding authors

Correspondence to Julien Knafou or Douglas Teodoro .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Knafou, J., Haas, Q., Borissov, N. et al. Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature. Syst Rev 12 , 94 (2023). https://doi.org/10.1186/s13643-023-02247-9

Download citation

Received : 25 July 2022

Accepted : 24 April 2023

Published : 05 June 2023

DOI : https://doi.org/10.1186/s13643-023-02247-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Living systematic review
  • Literature screening
  • Text classification
  • Language model
  • Deep learning
  • Transfer learning

Systematic Reviews

ISSN: 2046-4053

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

supporting systematic literature reviews using deep learning based language models

Advertisement

Advertisement

Deep learning: systematic review, models, challenges, and research directions

  • Open access
  • Published: 07 September 2023
  • Volume 35 , pages 23103–23124, ( 2023 )

Cite this article

You have full access to this open access article

  • Tala Talaei Khoei   ORCID: orcid.org/0000-0002-7630-9034 1 ,
  • Hadjar Ould Slimane 1 &
  • Naima Kaabouch 1  

8365 Accesses

9 Citations

4 Altmetric

Explore all metrics

The current development in deep learning is witnessing an exponential transition into automation applications. This automation transition can provide a promising framework for higher performance and lower complexity. This ongoing transition undergoes several rapid changes, resulting in the processing of the data by several studies, while it may lead to time-consuming and costly models. Thus, to address these challenges, several studies have been conducted to investigate deep learning techniques; however, they mostly focused on specific learning approaches, such as supervised deep learning. In addition, these studies did not comprehensively investigate other deep learning techniques, such as deep unsupervised and deep reinforcement learning techniques. Moreover, the majority of these studies neglect to discuss some main methodologies in deep learning, such as transfer learning, federated learning, and online learning. Therefore, motivated by the limitations of the existing studies, this study summarizes the deep learning techniques into supervised, unsupervised, reinforcement, and hybrid learning-based models. In addition to address each category, a brief description of these categories and their models is provided. Some of the critical topics in deep learning, namely, transfer, federated, and online learning models, are explored and discussed in detail. Finally, challenges and future directions are outlined to provide wider outlooks for future researchers.

Similar content being viewed by others

supporting systematic literature reviews using deep learning based language models

Future and Discussions

supporting systematic literature reviews using deep learning based language models

Deep reinforcement learning: a survey

Hao-nan Wang, Ning Liu, … Yi-ming Zhang

supporting systematic literature reviews using deep learning based language models

Machine Learning Paradigms: Introduction to Deep Learning-Based Technological Applications

Avoid common mistakes on your manuscript.

1 Introduction

The main concept of artificial neural networks (ANN) was proposed and introduced as a mathematical model of an artificial neuron in 1943 [ 1 , 2 , 3 ]. In 2006, the concept of deep learning (DL) was proposed as an ANN model with several layers, which has significant learning capacity. In recent years, DL models have seen tremendous progress in addressing and solving challenges, such as anomaly detection, object detection, disease diagnosis, semantic segmentation, social network analysis, and video recommendations [ 4 , 5 , 6 , 7 ].

Several studies have been conducted to discuss and investigate the importance of the DL models in different applications, as illustrated in Table 1 . For instance, the authors of [ 8 ] reviewed supervised, unsupervised, and reinforcement DL-based models. In [ 9 ], the authors outlined DL-based models, platforms, applications, and future directions. Another survey [ 10 ] provided a comprehensive review of the existing models in the literature in different applications, such as natural processing, social network analysis, and audio. In this study, the authors provided a recent advancement in DL applications and elaborated on some of the existing challenges faced by these applications. In [ 11 ], the authors highlighted different DL-based models, such as deep neural networks, convolutional neural networks, recurrent neural networks, and auto-encoders. They also covered their frameworks, benchmarks, and software development requirements. In [ 12 ], the authors discussed the main concepts of deep learning and neural networks. They also provided several applications of DL in a variety of areas.

Other studies covered particular challenges of DL models. For instance, the authors of [ 13 ] explored the importance of class imbalanced dataset on the performance of the DL models as well as the strengths and weaknesses of the methods proposed in the literature for solving class imbalanced data. Another study [ 14 ] explored the challenges that DL faces in the case of data mining, big data, and information processing due to huge volume of data, velocity, and variety. In [ 15 ], the authors analyzed the complexity of DL-based models and provided a review of the existing studies on this topic. In [ 16 ], the authors focused on the activation functions of DL. They introduced these functions as a strategy in DL to transfer nonlinearly separable input into the more linearly separable data by applying a hierarchy of layers, whereas they provided the most common activation functions and their characteristics.

In [ 17 ], the authors outlined the applications of DL in cybersecurity. They provided a comprehensive literature review of DL models in this field and discussed different types of DL models, such as convolutional neural networks, auto-encoders, and generative adversarial networks. They also covered the applications of different attack categories, such as malware, spam, insider threats, network intrusions, false data injection, and malicious in DL. In another study [ 18 ], the authors focused on detecting tiny objects using DL. They analyzed the performance of different DL in detecting these objects. In [ 19 ], the authors reviewed DL models in the building and construction industry-based applications while they discussed several important key factors of using DL models in manufacturing and construction, such as progress monitoring and automation systems. Another study [ 20 ] focused on using different strategies in the domain of artificial intelligence (AI), including DL in smart grids. In such a study, the authors introduced the main AI applications in smart grids while exploring different DL models in depth. In [ 7 ], the authors discussed the current progress of DL in medical areas and gave clear definitions of DL models and their theoretical concepts and architectures. In [ 21 ], the authors analyzed the DL applications in biology, medicine, and engineering domains. They also provided an overview of this field of study and major DL applications and illustrated the main characteristics of several frameworks, including molecular shuttles.

Despite the existing surveys in the field of DL focusing on a comprehensive overview of these techniques in different domains, the increasing amount of these applications and the existing limitations in the current studies motivated us to investigate this topic in depth. In general, the recent studies in the literature mostly discussed specific learning strategies, such as supervised models, while they did not cover different learning strategies and compare them with each other. In addition, the majority of the existing surveys excluded new strategies, such as online learning or federated learning, from their studies. Moreover, these surveys mostly explored specific applications in DL, such as the Internet of Things, smart grid, or constructions; however, this field of study requires formulation and generalization in different applications. In fact, limited information, discussions, and investigations in this domain may lead to prevent any development and progress in DL-based applications. To fill these gaps, this paper provides a comprehensive survey on four types of DL models, namely, supervised, unsupervised, reinforcement, and hybrid learning. It also provides the major DL models in each category and describes the main learning strategies, such as online, transfer, and federated learning. Finally, a detailed discussion of future direction and challenges is provided to support future studies. In short, the main contributions of this paper are as follows:

Classifications and in-depth descriptions of supervised, unsupervised, enforcement, and hybrid models. Description and discussion of learning strategies, such as online, federated, and transfer learning,

Comparison of different classes of learning strategies, their advantages, and disadvantages,

Current challenges and future directions in the domain of deep learning.

The remainder of this paper is organized as follows: Sect.  2 provides descriptions of the supervised, unsupervised, reinforcement, and hybrid learning models, along with a brief description of the models in each category. Section  3 highlights the main learning approaches that are used in deep learning. Section  4 discusses the challenges and future directions in the field of deep learning. The conclusion is summarized in Sect.  5 .

2 Categories of deep learning models

DL models can be classified into four categories, namely, deep supervised, unsupervised, reinforcement learning, and hybrid models. Figure  1 depicts the main categories of DL along with examples of models in each category. In the following, short descriptions of these categories are provided. In addition, Table 2 provides the most common techniques in every category.

figure 1

Schematic review of the models in deep learning

2.1 Deep supervised learning

Deep supervised learning-based models are one of the main categories of deep learning models that use a labeled training dataset to be trained. These models measure the accuracy through a function, loss function, and adjust the weights till the error has been minimized sufficiently. Among the supervised deep learning category, three important models are identified, namely, deep neural networks, convolutional neural networks, and recurrent neural network-based models, as illustrated in Fig.  2 . Artificial neural networks (ANN), known as neural networks or neural nets, are one of the computing systems, which are inspired by biological neural networks. ANN models are a collection of connected nodes (artificial neurons) that model the neurons in a biological brain. One of the simple ANN models is known as a deep neural network (DNN) [ 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 ]. DNN models consist of a hierarchical architecture with input, output, and hidden layers, each of which has a nonlinear information processing unit, as illustrated in Fig.  2 A. DNN, using the architecture of neural networks, consists of functions with higher complexity when the number of layers and units in a layer is increased. Some known instances of DNN models, as highlighted in Table 2 , are multi-layer perceptron, shallow neural network, operational neural network, self-operational neural network, and iterative residual blocks neural network.

figure 2

Inner architecture of deep supervised models

The second type of deep supervised models is convolutional neural networks (CNN), known as one of the important DL models that are used to capture the semantic correlations of underlying spatial features among slice-wise representations by convolution operations in multi-dimensional data [ 25 ]. A simple architecture of CNN-based models is shown in Fig.  2 B. In these models, the feature mapping has k filters that are partitioned spatially into several channels. In addition, the pooling function can shrink the width and height of the feature map, while the convolutional layer can apply a filter to an input to generate a feature map that can summarize the identified features as input. The convolutional layers are followed by one or more fully connected layers connected to all the neurons of the previous layer. CNN usually analyzes the hidden patterns using pooling layers for scaling functions, sharing the weights for reducing memories, and filtering the semantic correlation captured by convolutional operations. Therefore, CNN architecture provides a strong potential in spatial features. However, CNN models suffer from their disability in capturing particular features. Some known examples of this network are presented in Table 2 [ 7 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 ].

The other type of supervised DL is recurrent neural network (RNN) models, which are designed for sequential time-series data where the output is returned to the input, as shown in Fig.  2 C [ 27 ]. RNN-based models are widely used to memorize the previous inputs and handle the sequential data and existing inputs [ 42 ]. In RNN models, the recursive process has hidden layers with loops that indicate effective information about the previous states. In traditional neural networks, the given inputs and outputs are totally independent of one another, whereas the recurrent layers of RNN have a memory that remembers the whole data about what is exactly calculated [ 48 ]. In fact, in RNN, similar parameters for every input are applied to construct the neural network and estimate the outputs. The critical principle of RNN-based models is to model time collection samples; hence, specific patterns can be estimated to be dependent on previous ones [ 48 , 49 , 50 , 51 , 52 , 53 , 54 , 55 , 56 , 57 , 58 , 59 , 60 , 61 , 62 , 63 , 64 ]. Table 2 provides the instances of RNN-based models as simple recurrent neural network, long short-term memory, gated recurrent unit neural network, bidirectional gated recurrent unit neural network, bidirectional long short-term memory, and residual gated recurrent neural network [ 64 , 65 , 66 ]. Table  3 shows the advantages and disadvantages of supervised DL models.

3 Deep unsupervised learning

Deep unsupervised models have gained significant interest as a mainstream of viable deep learning models. These models are widely used to generate systems that can be trained with few numbers of unlabeled samples [ 24 ]. The models can be classified into auto-encoders , restricted Boltzmann machine, deep belief neural networks, and generative adversarial networks. An auto-encoder (AE) is a type of auto-associative feed-forward neural network that can learn effective representations from the given input in an unsupervised manner [ 29 ]. Figure  3 A provides a basic architecture of AE. As it can be seen, there are three elements in AE, encoder, latent space, and decoder. Initially, the corresponding input passes through the encoder. The encoder is mostly a fully connected ANN that is able to generate the code. In contrast, the decoder generates the outputs using the codes and has an architecture similar to ANN. The aim of having an encoder and decoder is to present an identical output with the given input. It is notable that the dimensionality of the input and output has to be similar. Additionally, real-world data usually suffer from redundancy and high dimensionality, resulting in lower computational efficiency and hindering the modeling of the representation. Thus, a latent space can address this issue by representing compressed data and learning the features of the data, and facilitating data representations to find patterns. As shown in Table 2 , AE consists of several known models, namely, stacked, variational, and convolutional AEs [ 30 , 43 ]. The advantages and disadvantages of these models are presented in Table 4 . 

figure 3

Inner architecture of deep unsupervised models

The restricted Boltzmann machine (RBM) model, known as Gibbs distribution, is a network of neurons that are connected to each other, as shown in Fig.  3 B. In RBM, the network consists of two layers, namely, the input or visible layer and the hidden layer. There is no output layer in RBM, while the Boltzmann machines are random and generative neural networks that can solve combinative problems. Some common RBM are presented in Table 2 as shallow restricted Boltzmann machines and convolutional restricted Boltzmann machines. The deep belief network (DBN) is another unsupervised deep neural network that performs in a similar way as the deep feed-forward neural network with inputs and multiple computational layers, known as hidden layers, as illustrated in Fig.  3 C. In DBM, there are two main phases that are necessary to be performed, pre-train and fine-tuning phases. The pre-train phase consists of several hidden layers; however, fine-tuning phase only is considered a feed-forward neural network to train and classify the data. In addition, DBN has multiple layers with values, while there is a relation between layers but not with the values [ 31 ]. Table 2 reviews some of the known DBN models, namely, shallow deep belief neural networks and conditional deep belief neural networks [ 44 , 45 ].

The generative adversarial network (GAN) is an another type of unsupervised deep learning model that uses a generator network (GN) and discriminator network (DN) to generate synthetic data to follow similar distribution from the original data, as presented in Fig.  3 D. In this context, the GN mimics the distribution of the given data using noise vectors to exhaust the DN to classify between fake and real samples. The DN can be trained to differentiate between fake and real samples by the GN from the original samples. In general, the GN learns to create plausible data, whereas the DN can learn to identify the generator’s fake data from the real ones. Additionally, the discriminator can penalize the generator for generating implausible data [ 32 , 54 ]. The known types of GAN are presented in Table 2 as generative adversarial networks, signal augmented self-taught learning, and Wasserstein generative adversarial networks. As a result of this discussion, Table 4 provides the main advantages and disadvantages of the unsupervised DL categories [ 56 ].

3.1 Deep reinforcement learning

Reinforcement learning (RL) is the science of making decisions with learning the optimal behavior in an environment to achieve maximum reward. The optimal behavior is achieved through interactions with the environment. In RL, an agent can make decisions, monitor the results, and adjust its technique to provide optimal policy [ 75 , 76 ]. In particular, RL is applied to assist an agent in learning the optimal policy when the agent has no information about the surrounding environments. Initially, the agent monitors the current state, takes action, and receives its reward with its new state. In this context, the immediate reward and new state can adjust the agent's policy; This process is repeated till the agent’s policy is getting close to the optimal policy. To be precise, RL does not need any detailed mathematical model for the system to guarantee optimal control [ 77 ]; however, the agent considers the target system as the environment and optimizes the control policy by communicating with it. The agent performs specific steps. During every step, the agent selects an action based on its existing policy, and the environment feeds back a reward and goes to the next state [ 78 , 79 , 80 ]. This process is learned by the agent to adjust its policy by referencing the relationships during the state, action, and rewards. The RL agent also can determine an optimal policy related to the maximum cumulative reward. In addition, an RL agent can be modeled as Markov decision process (MDP) [ 78 ]. In MDP, when the states and action spaces are finite, the process is known as finite. As it is clear, the RL learning approach may take a huge amount of time to achieve the best policy and discover the knowledge of a whole system; hence, RL is inappropriate for large-scale networks [ 81 ].

In the past few years, deep reinforcement learning (DRL) was proposed as an advanced model of RL in which DL is applied as an effective tool to enhance the learning rate for RL models. The achieved experiences are stored during the real-time learning process, whereas the generated data for training and validating neural networks are applied [ 82 ]. In this context, the trained neural network has to be used to assist the agent in making optimal decisions in real-time scenarios. DRL overcomes the main shortcomings of RL, such as long processing time to achieve optimal policy, thus opening a new horizon to embrace the DRL [ 83 ]. In general, as shown in Fig.  4 , DRL uses the deep neural networks’ characteristics to train the learning process, resulting in increasing the speed and improving the algorithms’ performance. In DRL, within the environment or agent interactions, the deep neural networks keep the internal policy of the agent, which indicates the next action according to the current state of the environment.

figure 4

Inner architecture of deep reinforcement learning

DRL can be divided into three methods, value-based, policy-based, and model-based methods. Value-based DRL mainly represents and finds the value functions and their optimal ones. In such methods, the agent learns the state or state-action value and behaves based on the best action in the state. One necessary step of these methods is to explore the environment. Some known instances of value-based DRL are deep Q-learning, double deep Q-learning, and duel deep Q-learning [ 83 , 84 , 85 ]. On the contrary, policy-based DRL finds an optimal policy, stochastic or deterministic, to better convergence on high-dimensional or continuous action space. These methods are mainly optimization techniques in which the maximum policy of function can be found. Some examples of policy-based DRL are deep deterministic policy gradient and asynchronous advantage actor critic [ 86 ]. The third category of DRL, model-based methods, aims at learning the functionality of the environment and its dynamics from its previous observations, while these methods attempt a solution using the specific model. For these methods, in the case of having a model, they find the best policy to be efficient, while the process may fail when the state space is huge. In model-based DRL, the model is often updated, and the process is replanned. Instances of model-based DRL are imagination-augmented agents, model-based priors for model-free, and model-based value expansion. Table 5 illustrates the important advantages and disadvantages of these categories [ 87 , 88 , 89 ].

3.2 Hybrid deep learning

Deep learning models have weaknesses and strengths in terms of hyperparameter tuning settings and data explorations [ 45 ]. Therefore, the highlighted weakness of these models can hinder them from being strong techniques in different applications. Every DL model also has characteristics that make it efficient for specific applications; hence, to overcome these shortcomings, hybrid DL models have been proposed based on individual DL models to tackle the shortcomings of specific applications [ 79 , 80 , 81 , 82 , 83 , 84 , 85 , 86 , 87 , 88 , 89 ]. Figure  5 indicates the popular hybrid DL models that are used in the literature. It is observed that convolutional neural networks and recurrent neural networks are widely used in existing studies and have high applicability and potentiality compared to other developed DL models.

figure 5

Review of popular hybrid models

4 Evaluation metrics

In any classification tasks, the metrics are required to evaluate the DL models. It is worth mentioning that various metrics can be used in different fields of studies. It means that the metrics which are used in medical analysis are mostly different with other domains, such as cybersecurity or computer visions. For this reason, we provide a short descriptions and a mathematical equations of the most common metrics in different domains, as following:

Accuracy: It is mainly used in classification problems to indicate the correct predictions made by a DL model. This metric is calculated, as shown in Eq. ( 1 ), where \({T}_{\mathrm{P}}\) is the true positive, \({T}_{\mathrm{N}}\) is true negative, \({F}_{\mathrm{P}}\) is the false positive, and \({F}_{\mathrm{N}}\) is the false negative.

Precision: It refers to the number of the true positives divided by the total number of the positive predictions, including true positive and false positive. This metric can be measured as following:

Recall (detection rate): It measures the number of the positive samples that are classified correctly to the total number of the positive samples. This metric, as measuring in Eq. ( 3 ), can indicate the model’s ability to classify positive samples among other samples.

F1-Score: It is calculated from the precision and recall of the test, where the precision is defined as Eq. ( 2 ), and recall is presented in Eq. ( 3 ). This metric is calculated as shown in Eq. ( 4 ):

Area under the receiver operating characteristics curve (AUC): AUC is one of the important metrics in classification problems. Receiver operating characteristic (ROC) helps to visualize the tradeoff between sensitivity and specificity in DL models. The AUC curve is a plot of true-positive rate (TPR) to false-positive rate (FPR). A good DL model has an AUC value near to 1. This metric is measured, as shown in Eq. ( 5 ), where x is the varying AUC parameter.

False Alarm Rate: This metric is also known as false-positive rate, which is the probability of a false alarm will be raised. It means, a positive result will be given when a true value is negative. This metric can be measured as shown in Eq. ( 6 ):

Misdetection Rate: It is a metric that shows the percentage of misclassified samples. This metric can be defined as the percentage of the samples that are not detected. It is also measured, as shown in Eq. ( 7 ):

5 Learning classification in deep learning models

Learning strategies, as shown in Fig.  6 , include online learning, transfer learning, and federated learning. In this section, these learning strategies are discussed in brief.

figure 6

Review of learning classification in deep learning models

5.1 Online learning

Conventional machine learning models mostly employ batch learning methods, in which a collection of training data is provided in advance to the model. This learning method requires the whole training dataset to be made accessible ahead to the training, which lead to high memory usage and poor scalability. On the other hand, online learning is a machine learning category where data are processed in sequential order, and the model is updated accordingly [ 90 ]. The purpose of online learning is to maximize the accuracy of the prediction model using the ground truth of previous predictions [ 91 ]. Unlike batch or offline machine learning approaches, which require the complete training dataset to be available to be trained on [ 92 ], online learning models use sequential stream of data to update their parameters after each data instance. Online learning is mainly optimal when the entire dataset is unavailable or the environment is dynamically changing [ 92 , 93 , 94 , 95 , 96 ]. On the other hand, batch learning is easier to maintain and less complex; it requires all the data to be available to be trained on it and does not update its model. Table 6 shows the advantages and disadvantages of batch learning and online learning.

An online model aims to learn a hypothesis \({\mathcal{H}}:X \to Y\) Where \(X\) is the input space, and \(Y\) is the output space. At each time step \(t\) , a new data instance \({\varvec{x}}_{{\varvec{t}}} \in X\) is received, and an output or prediction \(\hat{y}_{t}\) is generated using the mapping function \({\mathcal{H}}\left( {x_{t} ,w_{t} } \right) = \hat{y}_{t}\) , where \({{\varvec{w}}}_{{\varvec{t}}}\) is the weights’ vector of the online model at the time step \(t\) . The true class label \({y}_{t}\) is then utilized to calculate the loss and update the weights of the model \({\varvec{w}}_{{{\varvec{t}} + 1}}\) , which is illustrated in Fig.  7 [ 97 ].

figure 7

Online machine learning process

The number of mistakes committed by the online model across T time steps is defined as \({M}_{T}\) for \(\hat{y}_{t} \ne y_{t}\) [ 55 ]. The goal of an online learning model is to minimize the total loss of the online model performance compared to the best model in hindsight, which is defined as [ 35 ]

where the first term is the sum of the loss function at time step t, and the second term is the loss function of the best model after seeing all the instances [ 98 , 99 ]. While training the online model, different approaches can be adopted regarding data that the model has already trained on; full memory, in which the model preserves all training data instances; partial memory, where the model retains only some of the training data instances; and no memory, in which it remembers none. Two main techniques are utilized to remove training data instances: passive forgetting and active forgetting [ 107 , 108 , 109 ]

Passive forgetting only considers the amount of time that has passed since the training data instances were received by the model, which implies that the significance of data diminishes over time.

Active forgetting , on the other hand, requires additional information from the utilized training data in order to determine which objects to remove. The density-based forgetting and error-based forgetting are two active forgetting techniques.

Online learning techniques can be classified into three categories: online learning with full feedback, online learning with partial feedback, and online learning with no feedback. Online learning with full feedback is when all training data instances \(x\) have a corresponding true label \(y\) which is always disclosed to the model at the end of each online learning round. Online learning with partial feedback is when only partial feedback information is received that shows if the prediction is correct or not, rather than the corresponding true label explicitly. In this category, the online learning model is required to make online updates by seeking to maintain a balance between the exploitation of revealed knowledge and the exploration of unknown information with the environment [ 2 ]. On the other hand, online learning with no feedback is when only the training data are fed to the model without the ground truth or feedback. This category includes online clustering and dimension reduction [ 99 , 100 , 101 , 102 , 103 , 104 , 105 , 106 , 107 , 108 , 109 , 110 , 111 ].

5.2 Deep transfer learning

Training deep learning models from scratch needs extensive computational and memory resources and large amounts of labeled datasets. However, for some types of scenarios, huge, annotated datasets are not always available. Additionally, developing such datasets requires a great deal of time and is a costly operation. Transfer learning (TL) has been proposed as an alternative for training deep learning models [ 112 ]. In TL, the obtained knowledge from another domain can be easily transferred to target another classification problem. TL saves computing resources and increases efficiency in training new deep learning models. TL can also help train deep learning models on available annotated datasets before validating them on unlabeled data [ 113 , 114 ]. Figure  8 illustrates a simple visualization of the deep transfer learning, which can transfer valuable knowledge by further using the learning ability of neural networks.

figure 8

Visualization of deep transfer learning

In this survey, the deep transfer learning techniques are classified based on the generalization viewpoints between deep learning models and domains into four categories, namely, instance, feature representation, model parameter, and relational knowledge-based techniques. In the following, we briefly discuss these categories with their categorizations, as illustrated in Fig.  9 .

figure 9

Categories of deep transfer learning

5.2.1 Instance-based

Instance-based TL techniques are performed based on the selected instance or on selecting different weights for instances. In such techniques, the TL aims at training a more accurate model under a transfer scenario, in which the difference between a source and a target comes from the different marginal probability distributions or conditional probability distributions [ 62 ]. Instance-based TL presents the labeled samples that are only limited to training a classification model in the target domain. This technique can directly margin the source data into the target data, resulting in decreasing the target model performance and a negative transfer during training [ 109 , 110 , 111 ]. The main goal of instance-based TL is to single out the instances in the source domains. Such a process can have positive impact on the training of the models in target as well as augmenting the target data through particular weighting techniques. In this context, a viable solution is to learn the weights of the source domains' instances automatically in an objective function. The objective function is given by:

where \({W}_{i}\) is the weighting coefficient of the given source instance, \({C}^{s}\) represents the risks function of the selected source instance, and \({\vartheta }^{*}\) is the second risk function related to the target task or the parameter regularization.

The weighting coefficient of the given source instance can be computed as the ratio of the marginal probability distribution between source and target domains. Instance-based TL can be categorized into two subcategories, weight estimation and heuristic re-weighting-based techniques [ 63 ]. A weight estimation method can focus on scenarios in which there are limited labeled instances in the target domain, converting the instance transfer problem into the weight estimation problem using kernel embedding techniques. In contrast, a heuristic re-weighting technique is more effective for developing deep TL tasks that have labeled instances and are available in the target domains [ 64 ]. This technique aims at detecting negative source instances by applying instance re-weighting approaches in a heuristic manner. One of the known instance re-weighting approaches is the transfer adaptive boosting algorithm, in which the weights of the source and target instances are updated via several iterations [ 116 ].

5.2.2 Feature representation-based

Feature representation-based TL models can share or learn a common feature representation between a target and a source domain. This category uses models with the ability to transfer knowledge by learning similar representations at the feature space level. Its main aim is to learn the mapping function as a bridge to transfer raw data in source and target domains from various feature spaces to a latent feature space [ 109 ]. From a general perspective, feature representation-based TL covers two transfer styles with or without adapting to the target domain [ 110 ]. Techniques without adapting to the target domain can extract representations as inputs for the target models; however, the techniques with adapting to the target domain can extract feature representations across various domains via domain adaption techniques [ 112 ]. In general, techniques of adapting to the target domain are hard to implement, and their assumptions are weak to be justified in most of cases. On the contrary, techniques of adapting to the target domain are easy to implement, and their assumptions can be strong in different scenarios [ 111 ].

One important challenge in feature representation TL with domain adaptation is the estimation of representing invariance between source and target domains. There are three techniques to build representation invariance, leveraging discrepancy-based, adversarial-based, and reconstruction-based. Leveraging discrepancy-based can improve the learning transferable ability representations and decrease the discrepancy based on distance metrics between a given source and target, while the adversarial-based is inspired by GANs and provides the neural network with the ability to learn domain-invariant representations. In construction-based, the auto-encoder neural networks with specific task classifiers are combined to optimize the encoder architecture, which takes domain-specific representations and shares an encoder that learns representations between different domains [ 113 ].

5.2.3 Model parameter-based

Model parameter-based TL can share the neural network architecture and parameters between target and source domains. This category can convey the assumptions that can share in common between the source and target domains. In such a technique, transferable knowledge is embedded into the pre-trained source model. This pre-trained source model has a particular architecture with some parameters in the target model [ 99 ]. The aim of this process is to use a section of the pre-trained model in the source domain, which can improve the learning process in the target domain. These techniques are performed based on the assumption that labeled instances in the target domain are available during the training of the target model [ 99 , 100 , 101 , 102 , 103 ]. Model parameter-based TL is divided into two categories, sequential and joint training. In sequential training, the target deep model can be established by pretraining a model on an auxiliary domain. However, joint training focuses on developing the source and target tasks at the same time. There are two methods to perform joint training [ 104 ]. The first method is hard parameter sharing, which shares the hidden layers directly while maintaining the task-specific layers independently [ 99 , 100 , 101 , 102 , 103 , 104 , 105 , 106 , 107 , 108 , 109 , 110 , 111 , 112 , 113 , 114 , 115 , 116 , 117 , 118 ]. The second method is soft parameter sharing which changes the weight coefficient of the source and target tasks and adds regularization to the risk function. Table 7 shows the advantages and disadvantages of the three categories, instance-based, future representation-based, and model parameter-based.

5.3 Deep federated learning

In traditional centralized DL, the collected data have to be stored on local devices, such as personal computers [ 74 , 75 , 76 , 77 , 78 , 79 , 80 , 81 , 82 , 83 , 84 , 85 , 86 , 87 ]. In general, traditional centralized DL can store the user data on the central server and apply it for training and testing purposes, as illustrated in Fig.  10 A, while this process may deal with several shortcomings, such as high computational power, low security, and privacy. In such models, the efficiency and accuracy of the models heavily depend on the computational power and training process of the given data on a centralized server. As a result, centralized DL models not only provide low privacy and high risks of data leakage but also indicate the high demands on storage and computing capacities of the several machines which train the models in parallel. Therefore, federated learning (FL) was proposed as an emerging technology to address such challenges [ 104 , 105 , 106 , 107 , 108 , 109 , 110 , 111 , 112 , 113 , 114 , 115 , 116 , 117 , 118 , 119 ].

figure 10

Centralized and federated learning process flow

FL provides solutions to keep the users’ privacy by decentralizing data from the corresponding central server to devices and enabling artificial intelligence (AI) methods to discipline the data. Figure  10 B summarizes the main process in an FL model. In particular, the unavailability of sufficient data, high computational power, and a limited level of privacy using local data are three major benefits of FL AI over centralized AI [ 115 , 116 , 117 , 118 , 119 ]. For this purpose, FL models aim at training a global model which can be trained on data distributed on several devices while they can protect the data. In this context, FL finds an optimal global model, known as \(\theta\) , can minimize the aggregated local loss function, \({f}_{k}\) ( \({\theta }^{k}\) ), as shown in Eq. ( 10 ).

where X denotes the data feature, y is the data label, \({n}_{k}\) is the local data size, C is the ratio in which the local clients do not participate in every round of the models’ updates, l is the loss function, k is the client index, and \(\sum_{k=1}^{C*k}{n}_{k}\) shows the total number of sample pairs. FL can be classified based on the characteristics of the data distribution among the clients into two types, namely, vertical and horizontal FL models, as discussed in the following:

5.3.1 Horizontal federated learning

Horizontal FL, homogeneous FL, shows the cases in which the given training data of the participating clients share a similar feature space; however, these corresponding data have various sample spaces [ 76 ]. Client one and Client two have several data rows with similar features, whereas each row shows specific data for a unique client. A typical common algorithm, namely, federated averaging (FedAvg), is usually used as a horizontal FL algorithm. FedAvg is one of the most efficient algorithms for distributing training data with multiple clients. In such an algorithm, clients keep the data local for protecting their privacy, while central parameters are applied to communicate between different clients [ 69 , 70 , 71 , 72 , 73 , 74 , 75 , 76 , 77 , 78 , 79 , 80 , 81 , 82 , 83 , 84 , 85 , 86 , 87 , 88 , 89 , 90 , 91 , 92 , 93 , 94 , 95 , 96 , 97 , 98 , 99 , 100 , 101 , 102 , 103 , 104 , 105 , 106 , 107 , 108 , 109 , 110 , 111 , 112 , 113 , 114 , 115 , 116 , 117 , 118 , 119 , 120 , 121 , 122 ].

In addition, horizontal FL provides efficient solutions to avoid leaking private local data. This can happen since the global and local model parameters are only permitted to communicate between the servers and clients, whereas all the given training data are stored on the client devices without being accessed by any other parties [ 14 , 119 , 120 , 121 , 122 , 123 , 124 , 125 , 126 , 127 , 128 , 129 , 130 , 131 , 132 , 133 ]. Despite such advantages, constant downloading and uploading in horizontal FL may consume huge amounts of communication resources. In deep learning models, the situation is getting worse due to the needing huge amounts of computation and memory resources. To address such issues, several studies have been performed to decrease the computational efficiency of horizontal FL models [ 134 ]. These studies proposed methods to reduce communication costs using multi-objective evolutionary algorithms, model quantization, and sub-sampling techniques. In these studies, however, no private data can be accessed directly by any third party, the uploaded model parameters or gradients may still leak the data for every client [ 135 ].

5.3.2 Vertical federated learning

Vertical FL, heterogeneous FL, is one of the types of FL in which users’ training data can share the same sample space while they have multiple different feature spaces. Client one and Client two have similar data samples with different feature spaces, and all clients have their own local data that are mostly assumed to one client keeps all the data classes. Such clients with data labels are known as guest parties or active parties, and clients without labels are known as host parties [ 136 ]. In particular, in vertical FL, the common data between unrelated domains are mainly applied to train global DL models [ 137 ]. In this context, participants may use intermediate third-party resources to indicate encryption logic to guarantee the data stats are kept. Although it is not necessary to use third parties in this process, studies have demonstrated that vertical FL models with third parties using encryption techniques provide more acceptable results [ 14 , 89 , 90 , 91 , 92 , 93 , 94 , 95 , 96 , 97 , 98 , 99 , 100 , 101 , 102 , 103 , 104 , 105 , 106 , 107 , 108 , 109 , 110 , 111 , 112 , 113 , 114 , 115 , 116 , 117 , 118 , 119 , 120 , 121 , 122 , 123 , 124 , 125 , 126 , 127 , 128 , 129 , 130 , 131 , 132 , 133 , 134 , 135 , 136 , 137 , 138 ].

In contrast with horizontal FL, training parametric models in vertical FL has two benefits. Firstly, trained models in vertical FL have a similar performance as centralized models. As a matter of fact, the computed loss function in vertical FL is the same as the loss function in centralized models. Secondly, vertical FL often consumes fewer communication resources compared to horizontal FL [ 138 ]. Vertical FL only consumes more communication resources than horizontal FL if and only if the data size is huge. In vertical FL, privacy preservation is the main challenge. For this purpose, several studies have been conducted to investigate privacy preservation in vertical FL, using identity resolution schemes, protocols, and vertical decision learning schemes. Although these approaches improve the vertical FL models, there are still some main slight differences between horizontal and vertical FL [ 100 , 101 , 102 , 103 , 104 , 105 , 106 , 107 , 108 , 109 , 110 , 111 , 112 , 113 , 114 , 115 , 116 , 117 , 118 , 119 , 120 , 121 , 122 , 123 , 124 , 125 , 126 , 127 , 128 , 129 , 130 , 131 , 132 , 133 , 134 , 135 , 136 , 137 , 138 , 139 , 140 , 141 , 142 , 143 ].

Horizontal FL includes a server for aggregation of the global models. In contrast, the vertical FL does not have a central server and global model [ 14 , 122 , 123 , 124 , 125 , 126 , 127 , 128 , 129 , 130 ]. As a result, the output of the local model’s aggregation is done based on the guest client to build a proper loss function. Another difference is the model parameters or gradients between servers and clients in horizontal FL. Local model parameters in vertical FL depend on the local data feature spaces, while the guest client receives model outputs from the connected host clients [ 143 ]. In this process, the intermediate gradient values are sent back for updating local models [ 105 ]. Ultimately, the server and the clients communicate with one another once in a communication round in horizontal FL; however, the guest and host clients have to send and receive data several times in a communication round in vertical FL [ 14 , 106 , 107 , 108 , 109 , 110 , 111 , 112 , 113 , 114 , 115 , 116 , 117 , 118 , 119 , 120 , 121 , 122 , 123 , 124 , 125 , 126 , 127 , 128 ]. Table 8 summarizes the main advantages and disadvantages of vertical and horizontal FL and compares these FL learning categories with central learning.

6 Challenges and future directions

Deep learning models, while powerful and versatile, face several significant challenges. Addressing these challenges requires a multidisciplinary approach involving data collection and preprocessing techniques, algorithmic enhancements, fairness-aware model training, interpretability methods, safe learning, robust models to adversarial attacks, and collaboration with domain experts and affected communities to push the boundaries of deep learning and realize its full potential. A brief description of each of these challenges is given below.

6.1 Data availability and quality

Deep learning models require large amounts of labeled training data to learn effectively. However, obtaining sufficient and high-quality labeled data can be expensive, time-consuming, or challenging, particularly in specialized domains or when dealing with sensitive data such cybersecurity. Although there are several approaches, such as data augmentation, to generate high amounts of data, it can sometimes be cumbersome to generate enough training data and satisfy the requirements of DL models. In addition, having a small dataset may lead to overfitting issues where DL models perform well on the training data but fail to generalize to unseen data. Balancing model complexity and regularization techniques to avoid overfitting while achieving good generalization is a challenge in deep learning. In addition, exploring techniques to improve data efficiency, such as few-shot learning, active learning, or semi-supervised learning, remains an active area of research.

6.2 Ethics and fairness

The challenge of ethics and fairness in deep learning underscores the critical need to address biases, discrimination, and social implications embedded within these models. Deep learning systems learn patterns from vast and potentially biased datasets, which can perpetuate and amplify societal prejudices, leading to unfair or unjust outcomes. The ethical dilemma lies in the potential for these models to unintentionally marginalize certain groups or reinforce systemic disparities. As deep learning is increasingly integrated into decision-making processes across domains such as hiring, lending, and criminal justice, ensuring fairness and transparency becomes paramount. Striving for ethical deep learning involves not only detecting and mitigating biases but also establishing guidelines and standards that prioritize equitable treatment, encompassing a multidisciplinary effort to foster responsible AI innovation for the betterment of society.

6.3 Interpretability and explainability

Interpretability and explainability of deep learning pose significant challenges in understanding the inner workings of complex models. As deep neural networks become more intricate, with numerous layers and parameters, their decision-making processes often resemble “black boxes,” making it difficult to discern how and why specific predictions are made. This lack of transparency hinders the trust and adoption of these models, especially in high-stakes applications like health care and finance. Striking a balance between model performance and comprehensibility is crucial to ensure that stakeholders, including researchers, regulators, and end-users, can gain meaningful insights into the model's reasoning, enabling informed decisions and accountability while navigating the intricate landscape of modern deep learning.

6.4 Robustness to adversarial attacks

Deep learning models are susceptible to adversarial attacks, a concerning vulnerability that highlights the fragility of their decision boundaries. Adversarial attacks involve making small, carefully crafted perturbations to input data, often imperceptible to humans, which can lead to misclassification or erroneous outputs from the model. These attacks exploit the model's sensitivity to subtle changes in its input space, revealing a lack of robustness in real-world scenarios. Adversarial attacks not only challenge the reliability of deep learning systems in critical applications such as autonomous vehicles and security systems but also underscore the need for developing advanced defense mechanisms and more resilient models that can withstand these intentional manipulations. Therefore, developing robust models that can withstand such attacks and maintaining model security and data is of high importance.

6.5 Catastrophic forgetting

Catastrophic forgetting, or catastrophic interference, is a phenomenon that can occur in online deep learning, where a model forgets or loses previously learned information when it learns new information. This can lead to a degradation in performance on tasks that were previously well-learned as the model adjusts to new data. This catastrophic forgetting is particularly problematic because deep neural networks often have a large number of parameters and complex representations. When a neural network is trained on new data, the optimization process may adjust the weights and connections in a way that erases the knowledge the network had about previous tasks. Therefore, there is a need for models that address this phenomenon.

6.6 Safe learning

Safe deep learning models are designed and trained with a focus on ensuring safety, reliability, and robustness. These models are built to minimize risks associated with uncertainty, hazards, errors, and other potential failures that can arise in the deployment and operation of artificial intelligence systems. DL models without safety and risks considerations in ground or aerial robots can lead to unsafe outcomes, serious damage, and even casualties. The safety properties include estimating risks, dealing with uncertainty in data, and detecting abnormal system behaviors and unforeseen events to ensure safety and avoid catastrophic failures and hazards. The research in this area is still at a very early stage.

6.7 Transfer learning and adaptation

Transfer learning and adaptation present complex challenges in the realm of deep learning. While pretraining models on large datasets can capture valuable features and representations, effectively transferring this knowledge to new tasks or domains requires overcoming hurdles related to differences in data distributions, semantic gaps, and contextual variations. Adapting pre-trained models to specific target tasks demands careful fine-tuning, domain adaptation, or designing novel architectures that can accommodate varying input modalities and semantics. The challenge lies in striking a balance between leveraging the knowledge gained from pretraining and tailoring the model to extract meaningful insights from the new data, ensuring that the transferred representations are both relevant and accurate. Successfully addressing the intricacies of transfer learning and adaptation in deep learning holds the key to unlocking the full potential of AI across diverse applications and domains.

7 Conclusions

In recent years, deep learning has emerged as a prominent data-driven approach across diverse fields. Its significance lies in its capacity to reshape entire industries and tackle complex problems that were once challenging or insurmountable. While numerous surveys have been published on deep learning, its models, and applications, a notable proportion of these surveys has predominantly focused on supervised techniques and their potential use cases. In contrast, there has been a relative lack of emphasis on deep unsupervised and deep reinforcement learning methods. Motivated by these gaps, this survey offers a comprehensive exploration of key learning paradigms, encompassing supervised, unsupervised, reinforcement, and hybrid learning, while also describing prominent models within each category. Furthermore, it delves into cutting-edge facets of deep learning, including transfer learning, online learning, and federated learning. The survey finishes by outlining critical challenges and charting prospective pathways, thereby illuminating forthcoming research trends across diverse domains.

Data availability

Not applicable.

Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C (2018) A survey on deep transfer learning. In: International conference on artificial neural networks, Springer, Berlin; p 270–279

Tang B, Chen Z, Hefferman G, Pei S, Wei T, He H, Yang Q (2017) Incorporating intelligence in fog computing for big data analysis in smart cities. IEEE Trans Ind Informatics 13:2140–2150

Google Scholar  

Khoei TT, Aissou G, Al Shamaileh K, Devabhaktuni VK, Kaabouch N (2023) Supervised deep learning models for detecting GPS spoofing attacks on unmanned aerial vehicles. In: 2023 IEEE international conference on electro information technology (eIT), Romeoville, IL, USA, pp 340–346. https://doi.org/10.1109/eIT57321.2023.10187274

Article   Google Scholar  

Nguyen TT, Nguyen QVH, Nguyen DT, Nguyen DT, Huynh-The T, Nahavandi S, Nguyen TT, Pham QV, Nguyen CM (2022) Deep learning for deepfakes creation and detection: a survey. Comput Vis Image Underst 223:103525

Dong S, Wang P, Abbas K (2021) A survey on deep learning and its applications. Comput Scie Rev 40:100379

MathSciNet   MATH   Google Scholar  

Ni J, Young T, Pandelea V, Xue F, Cambria E (2022) Recent advances in deep learning based dialogue systems: a systematic survey. Artif Intell Rev 56:1–101

Piccialli F, Di Somma V, Giampaolo F, Cuomo S, Fortino G (2021) A survey on deep learning in medicine: why, how and when? Inf Fus 66:111–137

Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117

Hatcher WG, Yu W (2018) A survey of deep learning: platforms, applications and emerging research trends. IEEE Access 6:24411–24432. https://doi.org/10.1109/ACCESS.2018.2830661

Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP, Shyu ML, Chen SC, Iyengar SS (2018) A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv (CSUR) 51(5):1–36

Alom MZ et al (2019) A state-of-the-art survey on deep learning theory and architectures. Electronics 8(3):292. https://doi.org/10.3390/electronics8030292

Dargan S, Kumar M, Ayyagari MR, Kumar G (2020) A survey of deep learning and its applications: a new paradigm to machine learning. Arch of Computat Methods Eng 27(4):1071–1092

MathSciNet   Google Scholar  

Johnson JM, Khoshgoftaar TM (2019) Survey on deep learning with class imbalance. J Big Data 6(1):1–54

Zhang Q, Yang LT, Chen Z, Li P (2018) A survey on deep learning for big data. Inf Fus 42:146–157

Hu X, Chu L, Pei J, Liu W, Bian J (2021) Model complexity of deep learning: a survey. Knowl Inf Syst 63(10):2585–2619

Dubey SR, Singh SK, Chaudhuri BB (2022) Activation functions in deep learning: a comprehensive survey and benchmark. Neurocomputing 503:92–108

Berman D, Buczak A, Chavis J, Corbett C (2019) A survey of deep learning methods for cyber security. Information 10(4):122. https://doi.org/10.3390/info10040122

Tong K, Wu Y (2022) Deep learning-based detection from the perspective of small or tiny objects: a survey. Image Vis Comput 123:104471

Baduge SK, Thilakarathna S, Perera JS, Arashpour M, Sharafi P, Teodosio B, Shringi A, Mendis P (2022) Artificial intelligence and smart vision for building and construction 4.0: machine and deep learning methods and applications. Autom Constr 141:104440

Omitaomu OA, Niu H (2021) Artificial intelligence techniques in smart grid: a survey. Smart Cities 4(2):548–568. https://doi.org/10.3390/smartcities4020029

Akay A, Hess H (2019) Deep learning: current and emerging applications in medicine and technology. IEEE J Biomed Health Inform 23(3):906–920. https://doi.org/10.1109/JBHI.2019.2894713

Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26

Srinidhi CL, Ciga O, Martel AL (2021) Deep neural network models for computational histopathology: a survey. Med Image Anal 67:101813

Kattenborn T, Leitloff J, Schiefer F, Hinz S (2021) Review on convolutional neural networks (CNN) in vegetation remote sensing. ISPRS J Photogramm Remote Sens 173:24–49

Tugrul B, Elfatimi E, Eryigit R (2022) Convolutional neural networks in detection of plant leaf diseases: a review. Agriculture 12(8):1192

Yadav SP, Zaidi S, Mishra A, Yadav V (2022) Survey on machine learning in speech emotion recognition and vision systems using a recurrent neural network (RNN). Arch Computat Methods Eng 29(3):1753–1770

Mai HT, Lieu QX, Kang J, Lee J (2022) A novel deep unsupervised learning-based framework for optimization of truss structures. Eng Comput 39:1–24

Jiang H, Peng M, Zhong Y, Xie H, Hao Z, Lin J, Ma X, Hu X (2022) A survey on deep learning-based change detection from high-resolution remote sensing images. Remote Sens 14(7):1552

Mousavi SM, Beroza GC (2022) Deep-learning seismology. Science 377(6607):eabm4470

Song X, Li J, Cai T, Yang S, Yang T, Liu C (2022) A survey on deep learning based knowledge tracing. Knowl-Based Syst 258:110036

Wang J, Biljecki F (2022) Unsupervised machine learning in urban studies: a systematic review of applications. Cities 129:103925

Li Y (2022) Research and application of deep learning in image recognition. In: 2022 IEEE 2nd international conference on power, electronics and computer applications (ICPECA), p 994–999

Borowiec ML, Dikow RB, Frandsen PB, McKeeken A, Valentini G, White AE (2022) Deep learning as a tool for ecology and evolution. Methods Ecol Evol 13(8):1640–1660

Wang X et al (2022) Deep reinforcement learning: a survey. IEEE Trans on Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3207346

Pateria S, Subagdja B, Tan AH, Quek C (2021) Hierarchical reinforcement learning: A comprehensive survey. ACM Comput Surv (CSUR) 54(5):1–35

Amroune M (2019) Machine learning techniques applied to on-line voltage stability assessment: a review. Arch Comput Methods Eng 28:273–287

Liu S, Shi R, Huang Y, Li X, Li Z, Wang L, Mao D, Liu L, Liao S, Zhang M et al (2021) A data-driven and data-based framework for online voltage stability assessment using partial mutual information and iterated random forest. Energies 14:715

Ahmad A, Saraswat D, El Gamal A (2023) A survey on using deep learning techniques for plant disease diagnosis and recommendations for development of appropriate tools. Smart Agric Technol 3:100083

Khan A, Khan SH, Saif M, Batool A, Sohail A, Waleed Khan M (2023) A survey of deep learning techniques for the analysis of COVID-19 and their usability for detecting omicron. J Exp Theor Artif Intell. https://doi.org/10.1080/0952813X.2023.2165724

Wang C, Gong L, Wang A, Li X, Hung PCK, Xuehai Z (2017) SOLAR: services-oriented deep learning architectures. IEEE Trans Services Comput 14(1):262–273

Moshayedi AJ, Roy AS, Kolahdooz A, Shuxin Y (2022) Deep learning application pros and cons over algorithm deep learning application pros and cons over algorithm. EAI Endorsed Trans AI Robotics 1(1):1–13

Huang L, Luo R, Liu X, Hao X (2022) Spectral imaging with deep learning. Light: Sci Appl 11(1):61

Bhangale KB, Kothandaraman M (2022) Survey of deep learning paradigms for speech processing. Wireless Pers Commun 125(2):1913–1949

Khojaste-Sarakhsi M, Haghighi SS, Ghomi SF, Marchiori E (2022) Deep learning for Alzheimer’s disease diagnosis: a survey. Artif Intell Med 130:102332

Fu G, Jin Y, Sun S, Yuan Z, Butler D (2022) The role of deep learning in urban water management: a critical review. Water Res 223:118973

Kim L-W (2018) DeepX: deep learning accelerator for restricted Boltzmann machine artificial neural networks. IEEE Trans Neural Netw Learn Syst 29(5):1441–1453

Wang C, Gong L, Yu Q, Li X, Xie Y, Zhou X (2017) DLAU: a scalable deep learning accelerator unit on FPGA. IEEE Trans Comput-Aided Design Integr Circuits Syst 36(3):513–517

Dundar A, Jin J, Martini B, Culurciello E (2017) Embedded streaming deep neural networks accelerator with applications. IEEE Trans Neural Netw Learn Syst 28(7):1572–1583

De Mauro A, Greco M, Grimaldi M, Nobili G (2016) Beyond data scientists: a review of big data skills and job families. In: Proceedings of IFKAD, p 1844–1857

Lin S-B (2019) Generalization and expressivity for deep nets. IEEE Trans Neural Netw Learn Syst 30(5):1392–1406

Gopinath M, Sethuraman SC (2023) A comprehensive survey on deep learning based malware detection techniques. Comp Sci Rev 47:100529

MATH   Google Scholar  

Khalifa NE, Loey M, Mirjalili S (2022) A comprehensive survey of recent trends in deep learning for digital images augmentation. Artif Intell Rev 55:1–27

Peng S, Cao L, Zhou Y, Ouyang Z, Yang A, Li X, Jia W, Yu S (2022) A survey on deep learning for textual emotion analysis in social networks. Digital Commun Netw 8(5):745–762

Tao X, Gong X, Zhang X, Yan S, Adak C (2022) Deep learning for unsupervised anomaly localization in industrial images: a survey. IEEE Trans Instrum Meas 71:1–21. https://doi.org/10.1109/TIM.2022.3196436

Sharifani K, Amini M (2023) Machine learning and deep learning: a review of methods and applications. World Inf Technol Eng J 10(07):3897–3904

Li Q, Peng H, Li J, Xia C, Yang R, Sun L, Yu PS, He L (2022) A survey on text classification: from traditional to deep learning. ACM Trans Intell Syst Technol (TIST) 13(2):1–41

Zhou Z, Xiang Y, Hao Xu, Yi Z, Shi Di, Wang Z (2021) A novel transfer learning-based intelligent nonintrusive load-monitoring with limited measurements. IEEE Trans Instrum Meas 70:1–8

Akram MW, Li G, Jin Y, Chen X, Zhu C, Ahmad A (2020) Automatic detection of photovoltaic module defects in infrared images with isolated and develop-model transfer deep learning. Sol Energy 198:175–186

Karimipour H, Dehghantanha A, Parizi RM, Choo K-KR, Leung H (2019) A deep and scalable unsupervised machine learning system for cyber-attack detection in large-scale smart grids. IEEE Access 7:80778–80788

Moonesar IA, Dass R (2021) Artificial intelligence in health policy—a global perspective. Global J Comput Sci Technol 1:1–7

Mo Y, Wu Y, Yang X, Liu F, Liao Y (2022) Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing 493:626–646

Subramanian N, Elharrouss O, Al-Maadeed S, Chowdhury M (2022) A review of deep learning-based detection methods for COVID-19. Comput Biol Med 143:105233

Tsuneki M (2022) Deep learning models in medical image analysis. J Oral Biosci 64(3):312–320

Pan X, Lin X, Cao D, Zeng X, Yu PS, He L, Nussinov R, Cheng F (2022) Deep learning for drug repurposing: Methods, databases, and applications. Wiley Interdiscip Rev: Computat Mol Sci 12(4):e1597

Novakovsky G, Dexter N, Libbrecht MW, Wasserman WW, Mostafavi S (2023) Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat Rev Genet 24(2):125–137

Fan Y, Tao B, Zheng Y, Jang S-S (2020) A data-driven soft sensor based on multilayer perceptron neural network with a double LASSO approach. IEEE Trans Instrum Meas 69(7):3972–3979

Menghani G (2023) Efficient deep learning: a survey on making deep learning models smaller, faster, and better. ACM Comput Surv 55(12):1–37

Mehrish A, Majumder N, Bharadwaj R, Mihalcea R, Poria S (2023) A review of deep learning techniques for speech processing. Inf Fus 99:101869

Mohammed A, Kora R (2023) A comprehensive review on ensemble deep learning: opportunities and challenges. J King Saud Univ-Comput Inf Sci 35:757–774

Alzubaidi L, Bai J, Al-Sabaawi A, Santamaría J, Albahri AS, Al-dabbagh BSN, Fadhel MA, Manoufali M, Zhang J, Al-Timemy AH, Duan Y (2023) A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications. J Big Data 10(1):46

Katsogiannis-Meimarakis G, Koutrika G (2023) A survey on deep learning approaches for text-to-SQL. The VLDB J. https://doi.org/10.1007/s00778-022-00776-8

Soori M, Arezoo B, Dastres R (2023) Artificial intelligence, machine learning and deep learning in advanced robotics a review. Cognitive Robotics 3:57–70

Mijwil M, Salem IE, Ismaeel MM (2023) The significance of machine learning and deep learning techniques in cybersecurity: a comprehensive review. Iraqi J Comput Sci Math 4(1):87–101

de Oliveira RA, Bollen MH (2023) Deep learning for power quality. Electr Power Syst Res 214:108887

Yin L, Gao Qi, Zhao L, Zhang B, Wang T, Li S, Liu H (2020) A review of machine learning for new generation smart dispatch in power systems. Eng Appl Artif Intell 88:103372

Luong NC et al. (2019) Applications of deep reinforcement learning in communications and networking: a survey. In: IEEE communications surveys & tutorials, vol 21, no 4, p 3133–3174, https://doi.org/10.1109/COMST.2019.2916583

Kiran BR et al (2022) Deep reinforcement learning for autonomous driving: a survey. IEEE Trans Intell Transp Syst 23(6):4909–4926. https://doi.org/10.1109/TITS.2021.3054625

Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep Reinforcement Learning: A Brief Survey. IEEE Signal Process Mag 34(6):26–38. https://doi.org/10.1109/MSP.2017.2743240

Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643

Vinuesa R, Azizpour H, Leite I, Balaam M, Dignum V, Domisch S, Felländer A, Langhans SD, Tegmark M, Nerini FF (2020) The role of artificial intelligence in achieving the sustainable development goals. Nature Commun. https://doi.org/10.1038/s41467-019-14108-y

Khoei TT, Kaabouch N (2023) ACapsule Q-learning based reinforcement model for intrusion detection system on smart grid. In: 2023 IEEE international conference on electro information technology (eIT), Romeoville, IL, USA, pp 333–339. https://doi.org/10.1109/eIT57321.2023.10187374

Hoi SC, Sahoo D, Lu J, Zhao P (2021) Online learning: a comprehensive survey. Neurocomputing 459:249–289

Celard P, Iglesias EL, Sorribes-Fdez JM, Romero R, Vieira AS, Borrajo L (2023) A survey on deep learning applied to medical images: from simple artificial neural networks to generative models. Neural Comput Appl 35(3):2291–2323

Mohammad-Rahimi H, Rokhshad R, Bencharit S, Krois J, Schwendicke F (2023) Deep learning: a primer for dentists and dental researchers. J Dent 130:104430

Liu Z, Tong L, Chen L, Jiang Z, Zhou F, Zhang Q, Zhang X, Jin Y, Zhou H (2023) Deep learning based brain tumor segmentation: a survey. Complex Intell Syst 9(1):1001–1026

Zheng Y, Xu Z, Xiao A (2023) Deep learning in economics: a systematic and critical review. Artif Intell Rev 4:1–43

Jia T, Kapelan Z, de Vries R, Vriend P, Peereboom EC, Okkerman I, Taormina R (2023) Deep learning for detecting macroplastic litter in water bodies: a review. Water Res 231:119632

Newbury R, Gu M, Chumbley L, Mousavian A, Eppner C, Leitner J, Bohg J, Morales A, Asfour T, Kragic D, Fox D (2023) Deep learning approaches to grasp synthesis: a review. IEEE Trans Robotics. https://doi.org/10.1109/TRO.2023.3280597

Shafay M, Ahmad RW, Salah K, Yaqoob I, Jayaraman R, Omar M (2023) Blockchain for deep learning: review and open challenges. Clust Comput 26(1):197–221

Benczúr AA., Kocsis L, Pálovics R (2018) Online machine learning in big data streams. arXiv preprint arXiv:1802.05872

Shalev-Shwartz S (2011) Online learning and online convex optimization. Found Trends® Mach Learn 4(2):107–194

Millán Giraldo M, Sánchez Garreta JS (2008) A comparative study of simple online learning strategies for streaming data. WSEAS Trans Circuits Syst 7(10):900–910

Pinto G, Wang Z, Roy A, Hong T, Capozzoli A (2022) Transfer learning for smart buildings: a critical review of algorithms, applications, and future perspectives. Adv Appl Energy 5:100084

Sayed AN, Himeur Y, Bensaali F (2022) Deep and transfer learning for building occupancy detection: a review and comparative analysis. Eng Appl Artif Intell 115:105254

Li C, Zhang S, Qin Y, Estupinan E (2020) A systematic review of deep transfer learning for machinery fault diagnosis. Neurocomputing 407:121–135

Li W, Huang R, Li J, Liao Y, Chen Z, He G, Yan R, Gryllias K (2022) A perspective survey on deep transfer learning for fault diagnosis in industrial scenarios: theories, applications and challenges. Mech Syst Signal Process 167:108487

Wan Z, Yang R, Huang M, Zeng N, Liu X (2021) A review on transfer learning in EEG signal analysis. Neurocomputing 421:1–14

Tan C, Sun F, Kong T (2018) A survey on deep transfer learning.In: Proceedings of international conference on artificial neural networks. p 270–279

Qian F, Gao W, Yang Y, Yu D et al (2020) Potential analysis of the transfer learning model in short and medium-term forecasting of building HVAC energy consumption. Energy 193:116724

Weber M, Doblander C, Mandl P, (2020b). Towards the detection of building occupancy with synthetic environmental data. arXiv preprint arXiv:2010.04209

Zhu H, Xu J, Liu S, Jin Y (2021) Federated learning on non-IID data: a survey. Neurocomputing 465:371–390

Ouadrhiri AE, Abdelhadi A (2022) Differential privacy for deep and federated learning: a survey. IEEE Access 10:22359–22380. https://doi.org/10.1109/ACCESS.2022.3151670

Zhang C, Xie Y, Bai H, Yu B, Li W, Gao Y (2021) A survey on federated learning. Knowl-Based Syst 216:106775

Banabilah S, Aloqaily M, Alsayed E, Malik N, Jararweh Y (2022) Federated learning review: fundamentals, enabling technologies, and future applications. Inf Process Manag 59(6):103061

Mothukuri V, Parizi RM, Pouriyeh S, Huang Y, Dehghantanha A, Srivastava G (2021) A survey on security and privacy of federated learning. Futur Gener Comput Syst 115:619–640

McMahan HB, Moore E, Ramage D, Hampson S, Arcas BA (2017) Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the 20th international conference on artificial intelligence and statistics, AISTATS

Hardy S, Henecka W, Ivey-Law H, Nock R, Patrini G, Smith G, Thorne B (2017) Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677

Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H et al (2015) Xgboost: extreme gradient boosting. R Package Vers 1:4–2

Heng K, Fan T, Jin Y, Liu Y, Chen T, Yang Q (2019) Secureboost: a lossless federated learning framework. arXiv preprint arXiv:1901.08755

Konečný J, McMahan HB, Yu FX, Richtárik P, Suresh AT, Bacon D (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492

Hamedani L, Liu R, Atat J, Wu Y (2017) Reservoir computing meets smart grids: attack detection using delayed feedback networks. IEEE Trans Industr Inf 14(2):734–743

Yuan X, Xie L, Abouelenien M (2018) A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data. Pattern Recogn 77:160–172

Xiao B, Xiong J, Shi Y (2016) Novel applications of deep learning hidden features for adaptive testing. In: Proceedings of the 21st Asia and South Pacifc design automation conference, p 743–748

Zhong SH, Li Y, Le B (2015) Query oriented unsupervised multi document summarization via deep learning. Expert Syst Appl 42:1–10

Vincent P et al (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408

Alom MZ et al. (2017) Object recognition using cellular simultaneous recurrent networks and convolutional neural network. In: Neural networks (IJCNN), international joint conference on IEEE

Quang W, Stokes JW (2016) MtNet: a multi-task neural network for dynamic malware classification. in: proceedings of the international conference detection of intrusions and malware, and vulnerability assessment, Donostia-San Sebastián, Spain, 7–8 July, p 399–418

Kamilaris A, Prenafeta-Boldú FX (2018) Deep learning in agriculture: a survey. Comput Electron Agric 147:70–90

Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, Van Der Laak JA, Van Ginneken B, Sánchez CI (2017) A survey on deep learning in medical image analysis. Med Image Anal 42:60–88

Gheisari M, Ebrahimzadeh F, Rahimi M, Moazzamigodarzi M, Liu Y, Dutta Pramanik PK, Heravi MA, Mehbodniya A, Ghaderzadeh M, Feylizadeh MR, Kosari S (2023) Deep learning: applications, architectures, models, tools, and frameworks: a comprehensive survey. CAAI Trans Intell Technol. https://doi.org/10.1049/cit2.12180

Pichler M, Hartig F (2023) Machine learning and deep learning—a review for ecologists. Methods Ecol Evolut 14(4):994–1016

Wang N, Chen T, Liu S, Wang R, Karimi HR, Lin Y (2023) Deep learning-based visual detection of marine organisms: a survey. Neurocomputing 532:1–32

Lee M (2023) The geometry of feature space in deep learning models: a holistic perspective and comprehensive review. Mathematics 11(10):2375

Xu M, Yoon S, Fuentes A, Park DS (2023) A comprehensive survey of image augmentation techniques for deep learning. Pattern Recogn 137:109347

Minaee S, Abdolrashidi A, Su H, Bennamoun M, Zhang D (2023) Biometrics recognition using deep learning: a survey. Artif Intell Rev 56:1–49

Xiang H, Zou Q, Nawaz MA, Huang X, Zhang F, Yu H (2023) Deep learning for image inpainting: a survey. Pattern Recogn 134:109046

Chakraborty S, Mali K (2022) An overview of biomedical image analysis from the deep learning perspective. Research anthology on improving medical imaging techniques for analysis and intervention. IGI Global, Hershey, pp 43–59

Lestari, N.I., Hussain, W., Merigo, J.M. and Bekhit, M., 2023, January. A Survey of Trendy Financial Sector Applications of Machine and Deep Learning. In: Application of big data, blockchain, and internet of things for education informatization: second EAI international conference, BigIoT-EDU 2022, Virtual Event, July 29–31, 2022, Proceedings, Part III, Springer Nature, Cham, p. 619–633

Chaddad A, Peng J, Xu J, Bouridane A (2023) Survey of explainable AI techniques in healthcare. Sensors 23(2):634

Grumiaux PA, Kitić S, Girin L, Guérin A (2022) A survey of sound source localization with deep learning methods. J Acoust Soc Am 152(1):107–151

Zaidi SSA, Ansari MS, Aslam A, Kanwal N, Asghar M, Lee B (2022) A survey of modern deep learning based object detection models. Digital Signal Process 126:103514

Dong J, Zhao M, Liu Y, Su Y, Zeng X (2022) Deep learning in retrosynthesis planning: datasets, models and tools. Brief Bioinf 23(1):391

Zhan ZH, Li JY, Zhang J (2022) Evolutionary deep learning: a survey. Neurocomputing 483:42–58

Matsubara Y, Levorato M, Restuccia F (2022) Split computing and early exiting for deep learning applications: survey and research challenges. ACM Comput Surv 55(5):1–30

Zhang B, Rong Y, Yong R, Qin D, Li M, Zou G, Pan J (2022) Deep learning for air pollutant concentration prediction: a review. Atmos Environ 290:119347

Yu X, Zhou Q, Wang S, Zhang YD (2022) A systematic survey of deep learning in breast cancer. Int J Intell Syst 37(1):152–216

Behrad F, Abadeh MS (2022) An overview of deep learning methods for multimodal medical data mining. Expert Syst Appl 200:117006

Mittal S, Srivastava S, Jayanth JP (2022) A survey of deep learning techniques for underwater image classification. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3143887

Tercan H, Meisen T (2022) Machine learning and deep learning based predictive quality in manufacturing: a systematic review. J Intell Manuf 33(7):1879–1905

Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2022) From show to tell: a survey on deep learning-based image captioning. IEEE Trans Pattern Anal Mach Intell 45(1):539–559

Caldas S, Konečný J, McMahan HB, Talwalkar A (2018) Expanding the reach of federated learning by reducing client resource requirements. arXiv preprint arXiv:1812.07210

Chen Y, Sun X, Jin Y (2019) Communication-efficient federated deep learning with layerwise asynchronous model update and temporally weighted aggregation. IEEE Trans Neural Netw Learn Syst 31:4229–4238

Zhu H, Jin Y (2019) Multi-objective evolutionary federated learning. IEEE Trans Neural Netw Learn Syst 31:1310–1322

Download references

Acknowledgements

The authors acknowledge the support of the National Science Foundation (NSF), Award Number 2006674.

Author information

Authors and affiliations.

School of Electrical Engineering and Computer Science, University of North Dakota, Grand Forks, ND, 58202, USA

Tala Talaei Khoei, Hadjar Ould Slimane & Naima Kaabouch

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Tala Talaei Khoei .

Ethics declarations

Conflict of interest.

The authors declare no conflicts of interest relevant to this article.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Talaei Khoei, T., Ould Slimane, H. & Kaabouch, N. Deep learning: systematic review, models, challenges, and research directions. Neural Comput & Applic 35 , 23103–23124 (2023). https://doi.org/10.1007/s00521-023-08957-4

Download citation

Received : 31 May 2023

Accepted : 15 August 2023

Published : 07 September 2023

Issue Date : November 2023

DOI : https://doi.org/10.1007/s00521-023-08957-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Artificial intelligence
  • Neural networks
  • Deep learning
  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning
  • Online learning
  • Federated learning
  • Transfer learning
  • Find a journal
  • Publish with us
  • Track your research

A Systematic Literature Review on Text Generation Using Deep Neural Network Models

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

IMAGES

  1. Figure 1 from Supporting Systematic Literature Reviews Using Deep

    supporting systematic literature reviews using deep learning based language models

  2. Systematic reviews

    supporting systematic literature reviews using deep learning based language models

  3. 10 Steps to Write a Systematic Literature Review Paper in 2023

    supporting systematic literature reviews using deep learning based language models

  4. How to Conduct a Systematic Review

    supporting systematic literature reviews using deep learning based language models

  5. systematic guide to literature review

    supporting systematic literature reviews using deep learning based language models

  6. How to conduct a Systematic Literature Review

    supporting systematic literature reviews using deep learning based language models

VIDEO

  1. Robust R-peak Detection using Deep Learning based on Integrating Domain Knowledge

  2. Interpreting systematic literature reviews and Commonly used performance indicators in Social Protec

  3. Systematic Literature Review, Part 2: How

  4. Systematic Literature Review- Part 1, What and Why

  5. Impact of Decentralization on Electronic Voting Systems A Systematic Literature Survey

  6. A Comprehensive Systematic Literature Review on Intrusion Detection Systems

COMMENTS

  1. Supporting Systematic Literature Reviews Using Deep-Learning-Based

    Background: Systematic Literature Reviews are an important research method for gathering and evaluating the available evidence regarding a specific research topic. However, the process of conducting a Systematic Literature Review manually can be difficult and time-consuming. For this reason, researchers aim to semi-automate this process or some of its phases.Aim: We aimed at using a deep ...

  2. Supporting Systematic Literature Reviews using Deep-Learning-Based

    Supporting Systematic Literature Reviews Using Deep-Learning-Based Language Models RandAlchokr Otto-von-GuerickeUniversity Magdeburg,Germany [email protected] ManojBorkar Otto-von-GuerickeUniversity Magdeburg,Germany [email protected] SharanyaThotadarya Otto-von-GuerickeUniversity Magdeburg,Germany [email protected] ...

  3. Supporting systematic literature reviews using deep-learning-based

    Method: We performed an experiment using two manually conducted SLRs to evaluate the performance of two deep-learning-based clustering models. These models build on transformer-based deep language models (i.e., BERT and S-BERT) to extract contextualized embeddings on different text levels along with a weighted scheme to cluster similar ...

  4. Supporting Systematic Literature Reviews Using Deep-Learning-Based

    Supporting Systematic Literature Reviews Using Deep-Learning-Based Language Models. May 2022. DOI: 10.1145/3528588.3528658. Conference: 1st Workshop on Natural Language-based Software Engineering ...

  5. Supporting Systematic Literature Reviews Using Deep-Learning-Based

    A deep-learning based contextualized embeddings clustering technique involving transformer-based language models and a weighted scheme to accelerate the conduction phase of Systematic Literature Reviews for efficiently scanning the initial set of retrieved publications is aimed at using. Background: Systematic Literature Reviews are an important research method for gathering and evaluating the ...

  6. Ensemble of deep learning language models to support the creation of

    Recent deep learning-based language models, such as the Bidirectional Encoder Representations of Transformers (BERT) , learn word representations considering both the forward- and backward-direction contexts of a word using a masked word approach, in which random words are masked from a context and the algorithm tries to predict the most likely ...

  7. PDF Ensemble of deep learning language models to ...

    state-of-the-art text classiers use deep learning-based language models to create word and document contex-tual representations, with improved syntactic and seman - tic features [29]. Language models are a particular type of probabilistic model that, given a sequence of words, compute the probability distribution of the next word.

  8. Manoj-Borkar/Supporting-SLR-Using-DL-Based-Language-Models

    The repository contains code scripts for replicating experiments in the paper "Supporting Systematic Literature Reviews Using Deep-Learning-Based Language Models". In this paper, we address the tedious process of identifying relevant primary studies during the conduct phase of a Systematic Literature Review. For this purpose, we use deep learning architectures in the form of the two language ...

  9. PDF Deep learning: systematic review, models, challenges, and ...

    Balancing model complexity and regularization techniques to avoid overfitting while achieving good generalization is a challenge in deep learning. In addition, exploring techniques to improve data efficiency, such as few-shot learning, active learning, or semi-supervised learning, remains an active area of research.

  10. Recent advances in deep learning models: a systematic literature review

    In recent years, deep learning has evolved as a rapidly growing and stimulating field of machine learning and has redefined state-of-the-art performances in a variety of applications. There are multiple deep learning models that have distinct architectures and capabilities. Up to the present, a large number of novel variants of these baseline deep learning models is proposed to address the ...

  11. Ensemble of deep learning language models to support the ...

    Multiple living systematic reviews have been developed to assist professionals with up-to-date and trustworthy health information, but it is increasingly challenging for systematic reviewers to keep up with the evidence in electronic databases. ... five different pre-trained deep learning-based language models were fine-tuned on a dataset of ...

  12. A systematic literature review and analysis of deep learning algorithms

    The Preferred Reporting Items for Systematic Reviews and Meta-Analyses Protocol (PRISMA) was used to select studies for the current review [36].Selected papers include studies in English language literature that used deep learning methods in mental disorders between 2000 to February 2023 extracted from Google Scholar, Scopus, Web of Science, and Science Direct.

  13. Ensemble of deep learning language models to support the ...

    This study shows the potential of using deep learning language models to perform triage of COVID-19 references efficiently and support epidemiological curation and review. ... Multiple living systematic reviews have been developed to assist professionals with up-to-date and trustworthy health information, but it is increasingly challenging for ...

  14. Ensemble of deep learning language models to ...

    Study design. An overview of the study design is presented in Fig. 1.In this retrospective machine learning-based study, we evaluated the performance of different deep learning text classifiers to categorize COVID-19 literature according to their publication type in the COVID-19 Open Access Project (COAP) living evidence database aggregator, which includes publications about SARS-CoV-2 and ...

  15. PDF Recent advances in deep learning models: a systematic literature review

    In recent years, deep learning has evolved as a rapidly growing and stimulating field of machine learning and has redefined state-of-the-art performances in a variety of applica-tions. There are multiple deep learning models that have distinct architectures and capa-bilities. Up to the present, a large number of novel variants of these baseline ...

  16. (PDF) Framework for Deep Learning-Based Language Models Using Multi

    The Systematic Literature Review (SLR) is prepared using the literature search guidelines proposed by Kitchenham and Charters on various language models between 2011 and 2021.

  17. Supporting Systematic Literature Reviews Using Deep-Learning-Based

    DOI: 10.1145/3528588.3528658 Corpus ID: 250182804; Supporting Systematic Literature Reviews Using Deep-Learning-Based Language Models @article{Alchokr2022SupportingSL, title={Supporting Systematic Literature Reviews Using Deep-Learning-Based Language Models}, author={Rand Alchokr and Manoj Ramakant Borkar and Sharanya Thotadarya and Gunter Saake and Thomas Leich}, journal={2022 IEEE/ACM 1st ...

  18. Framework for Deep Learning-Based Language Models Using Multi-Task

    Learning human languages is a difficult task for a computer. However, Deep Learning (DL) techniques have enhanced performance significantly for almost all-natural language processing (NLP) tasks. Unfortunately, these models cannot be generalized for all the NLP tasks with similar performance. NLU (Natural Language Understanding) is a subset of NLP including tasks, like machine translation ...

  19. Deep learning: systematic review, models, challenges, and research

    The current development in deep learning is witnessing an exponential transition into automation applications. This automation transition can provide a promising framework for higher performance and lower complexity. This ongoing transition undergoes several rapid changes, resulting in the processing of the data by several studies, while it may lead to time-consuming and costly models. Thus ...

  20. A Systematic Literature Review on the Use of Deep Learning in Software

    By framing our top-level research questions according to these components, we can ensure that that analysis component of our literature review effectively captures the essential elements that any research project applying a deep learning-based solution should discuss, allowing for a thorough taxonomic inspection. Given that these components ...

  21. A Systematic Literature Review on the Use of Deep Learning in Software

    Software engineering (SE) research investigates questions pertaining to the design, development, maintenance, testing, and evolution of software systems. As software continues to pervade a wide range of industries, both open- and closed-source code repositories have grown to become. ∗Authors have contributed equally.

  22. A Systematic Literature Review on Text Generation Using Deep Neural

    To the best of our knowledge, there is a lack of extensive review and an up-to-date body of knowledge of text generation deep learning models. Therefore, this survey aims to bring together all the relevant work in a systematic mapping study highlighting key contributions from various researchers over the years, focusing on the past, present ...