Transformer: A Novel Neural Network Architecture for Language Understanding

Neural networks, in particular recurrent neural networks (RNNs), are now at the core of the leading approaches to language understanding tasks such as language modeling , machine translation and question answering . In “ Attention Is All You Need ”, we introduce the Transformer, a novel neural network architecture based on a self-attention mechanism that we believe to be particularly well suited for language understanding.

In our paper, we show that the Transformer outperforms both recurrent and convolutional models on academic English to German and English to French translation benchmarks. On top of higher translation quality, the Transformer requires less computation to train and is a much better fit for modern machine learning hardware, speeding up training by up to an order of magnitude.

Accuracy and Efficiency in Language Understanding

Neural networks usually process language by generating fixed- or variable-length vector-space representations. After starting with representations of individual words or even pieces of words, they aggregate information from surrounding words to determine the meaning of a given bit of language in context. For example, deciding on the most likely meaning and appropriate representation of the word “bank” in the sentence “I arrived at the bank after crossing the…” requires knowing if the sentence ends in “... road.” or “... river.”

RNNs have in recent years become the typical network architecture for translation, processing language sequentially in a left-to-right or right-to-left fashion. Reading one word at a time, this forces RNNs to perform multiple steps to make decisions that depend on words far away from each other. Processing the example above, an RNN could only determine that “bank” is likely to refer to the bank of a river after reading each word between “bank” and “river” step by step. Prior research has shown that, roughly speaking, the more such steps decisions require, the harder it is for a recurrent network to learn how to make those decisions.

The sequential nature of RNNs also makes it more difficult to fully take advantage of modern fast computing devices such as TPUs and GPUs, which excel at parallel and not sequential processing. Convolutional neural networks (CNNs) are much less sequential than RNNs, but in CNN architectures like ByteNet or ConvS2S the number of steps required to combine information from distant parts of the input still grows with increasing distance.

The Transformer

In contrast, the Transformer only performs a small, constant number of steps (chosen empirically). In each step, it applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position. In the earlier example “I arrived at the bank after crossing the river”, to determine that the word “bank” refers to the shore of a river and not a financial institution, the Transformer can learn to immediately attend to the word “river” and make this decision in a single step. In fact, in our English-French translation model we observe exactly this behavior.

More specifically, to compute the next representation for a given word - “bank” for example - the Transformer compares it to every other word in the sentence. The result of these comparisons is an attention score for every other word in the sentence. These attention scores determine how much each of the other words should contribute to the next representation of “bank”. In the example, the disambiguating “river” could receive a high attention score when computing a new representation for “bank”. The attention scores are then used as weights for a weighted average of all words’ representations which is fed into a fully-connected network to generate a new representation for “bank”, reflecting that the sentence is talking about a river bank.

The animation below illustrates how we apply the Transformer to machine translation. Neural networks for machine translation typically contain an encoder reading the input sentence and generating a representation of it. A decoder then generates the output sentence word by word while consulting the representation generated by the encoder. The Transformer starts by generating initial representations, or embeddings, for each word. These are represented by the unfilled circles. Then, using self-attention, it aggregates information from all of the other words, generating a new representation per word informed by the entire context, represented by the filled balls. This step is then repeated multiple times in parallel for all words, successively generating new representations.

The decoder operates similarly, but generates one word at a time, from left to right. It attends not only to the other previously generated words, but also to the final representations generated by the encoder.

Flow of Information

Beyond computational performance and higher accuracy, another intriguing aspect of the Transformer is that we can visualize what other parts of a sentence the network attends to when processing or translating a given word, thus gaining insights into how information travels through the network.

To illustrate this, we chose an example involving a phenomenon that is notoriously challenging for machine translation systems: coreference resolution. Consider the following sentences and their French translations:

transformer model research paper

It is obvious to most that in the first sentence pair “it” refers to the animal, and in the second to the street. When translating these sentences to French or German, the translation for “it” depends on the gender of the noun it refers to - and in French “animal” and “street” have different genders. In contrast to the current Google Translate model, the Transformer translates both of these sentences to French correctly. Visualizing what words the encoder attended to when computing the final representation for the word “it” sheds some light on how the network made the decision. In one of its steps, the Transformer clearly identified the two nouns “it” could refer to and the respective amount of attention reflects its choice in the different contexts.

Given this insight, it might not be that surprising that the Transformer also performs very well on the classic language analysis task of syntactic constituency parsing, a task the natural language processing community has attacked with highly specialized systems for decades.

In fact, with little adaptation, the same network we used for English to German translation outperformed all but one of the previously proposed approaches to constituency parsing.

We are very excited about the future potential of the Transformer and have already started applying it to other problems involving not only natural language but also very different inputs and outputs, such as images and video. Our ongoing experiments are accelerated immensely by the Tensor2Tensor library , which we recently open sourced. In fact, after downloading the library you can train your own Transformer networks for translation and parsing by invoking just a few commands . We hope you’ll give it a try, and look forward to seeing what the community can do with the Transformer.

Acknowledgements

This research was conducted by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser and Illia Polosukhin. Additional thanks go to David Chenell for creating the animation above.

  • Search for: Toggle Search

‘You Transformed the World,’ NVIDIA CEO Tells Researchers Behind Landmark AI Paper

Of GTC ’s 900+ sessions, the most wildly popular was a conversation hosted by NVIDIA founder and CEO Jensen Huang with seven of the authors of the legendary research paper that introduced the aptly named transformer — a neural network architecture that went on to change the deep learning landscape and enable today’s era of generative AI.

“Everything that we’re enjoying today can be traced back to that moment,” Huang said to a packed room with hundreds of attendees, who heard him speak with the authors of “ Attention Is All You Need .”

Sharing the stage for the first time, the research luminaries reflected on the factors that led to their original paper, which has been cited more than 100,000 times since it was first published and presented at the NeurIPS AI conference. They also discussed their latest projects and offered insights into future directions for the field of generative AI.

While they started as Google researchers, the collaborators are now spread across the industry, most as founders of their own AI companies.

“We have a whole industry that is grateful for the work that you guys did,” Huang said.

transformer model research paper

Origins of the Transformer Model

The research team initially sought to overcome the limitations of recurrent neural networks , or RNNs, which were then the state of the art for processing language data.

Noam Shazeer, cofounder and CEO of Character.AI, compared RNNs to the steam engine and transformers to the improved efficiency of internal combustion.

“We could have done the industrial revolution on the steam engine, but it would just have been a pain,” he said. “Things went way, way better with internal combustion.”

“Now we’re just waiting for the fusion,” quipped Illia Polosukhin, cofounder of blockchain company NEAR Protocol.

The paper’s title came from a realization that attention mechanisms — an element of neural networks that enable them to determine the relationship between different parts of input data — were the most critical component of their model’s performance.

“We had very recently started throwing bits of the model away, just to see how much worse it would get. And to our surprise it started getting better,” said Llion Jones, cofounder and chief technology officer at Sakana AI.

Having a name as general as “transformers” spoke to the team’s ambitions to build AI models that could process and transform every data type — including text, images, audio, tensors and biological data.

“That North Star, it was there on day zero, and so it’s been really exciting and gratifying to watch that come to fruition,” said Aidan Gomez, cofounder and CEO of Cohere. “We’re actually seeing it happen now.”

transformer model research paper

Envisioning the Road Ahead 

Adaptive computation, where a model adjusts how much computing power is used based on the complexity of a given problem, is a key factor the researchers see improving in future AI models.

“It’s really about spending the right amount of effort and ultimately energy on a given problem,” said Jakob Uszkoreit, cofounder and CEO of biological software company Inceptive. “You don’t want to spend too much on a problem that’s easy or too little on a problem that’s hard.”

A math problem like two plus two, for example, shouldn’t be run through a trillion-parameter transformer model — it should run on a basic calculator, the group agreed.

They’re also looking forward to the next generation of AI models.

“I think the world needs something better than the transformer,” said Gomez. “I think all of us here hope it gets succeeded by something that will carry us to a new plateau of performance.”

“You don’t want to miss these next 10 years,” Huang said. “Unbelievable new capabilities will be invented.”

The conversation concluded with Huang presenting each researcher with a framed cover plate of the NVIDIA DGX-1 AI supercomputer, signed with the message, “You transformed the world.”

transformer model research paper

There’s still time to catch the session replay by registering for a virtual GTC pass — it’s free.

To discover the latest in generative AI, watch Huang’s GTC keynote address:

NVIDIA websites use cookies to deliver and improve the website experience. See our cookie policy for further details on how we use cookies and how to change your cookie settings.

Share on Mastodon

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Biology (Basel)
  • PMC10376273

Logo of biology

Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review

Associated data.

No new data were created or analyzed in this study.

Simple Summary

The rapidly advancing field of deep learning, specifically transformer-based architectures and attention mechanisms, has found substantial applicability in bioinformatics and genome data analysis. Given the analogous nature of genome sequences to language texts, these techniques initially successful in natural language processing have been applied to genomic data. This review provides an in-depth analysis of the most recent advancements and applications of these techniques to genome data, critically evaluating their advantages and limitations. By investigating studies from 2019 to 2023, this review identifies potential future research areas, thereby encouraging further advancements in the field.

The emergence and rapid development of deep learning, specifically transformer-based architectures and attention mechanisms, have had transformative implications across several domains, including bioinformatics and genome data analysis. The analogous nature of genome sequences to language texts has enabled the application of techniques that have exhibited success in fields ranging from natural language processing to genomic data. This review provides a comprehensive analysis of the most recent advancements in the application of transformer architectures and attention mechanisms to genome and transcriptome data. The focus of this review is on the critical evaluation of these techniques, discussing their advantages and limitations in the context of genome data analysis. With the swift pace of development in deep learning methodologies, it becomes vital to continually assess and reflect on the current standing and future direction of the research. Therefore, this review aims to serve as a timely resource for both seasoned researchers and newcomers, offering a panoramic view of the recent advancements and elucidating the state-of-the-art applications in the field. Furthermore, this review paper serves to highlight potential areas of future investigation by critically evaluating studies from 2019 to 2023, thereby acting as a stepping-stone for further research endeavors.

1. Introduction

The revolution of deep learning methodologies has invigorated the field of bioinformatics and genome data analysis, establishing a foundation for ground-breaking advancements and novel insights [ 1 , 2 , 3 , 4 , 5 , 6 ]. Recently, the development and application of transformer-based architectures and attention mechanisms have demonstrated superior performance and capabilities in handling the inherent complexity of genome data. Deep learning techniques, particularly those utilizing transformer architectures and attention mechanisms, have shown remarkable success in various domains such as natural language processing (NLP) [ 7 ] and computer vision [ 8 , 9 , 10 ]. These accomplishments have motivated their rapid adoption into bioinformatics, given the similar nature of genome sequences to language texts. Genome sequences can be interpreted as the language of biology, and thus, tools proficient in handling language data can potentially decipher the hidden patterns within these sequences.

The attention mechanism, first introduced in sequence-to-sequence models [ 11 ], has revolutionized how deep learning models handle and interpret data [ 12 , 13 , 14 , 15 , 16 , 17 , 18 ]. This technique was designed to circumvent the limitations of traditional recurrent models by providing a mechanism to attend to different parts of the input sequence when generating the output. In the context of genome data, this implies the ability to consider different genomic regions and their relations dynamically during the interpretation process. The attention mechanism computes a weighted sum of input features, where the weights, also known as attention scores, are dynamically determined based on the input data. This mechanism allows the model to focus more on essential or relevant features and less on irrelevant or less important ones.

Inspired by the success of attention mechanisms, the transformer model was proposed as a complete shift from the sequential processing nature of recurrent neural networks (RNNs) and their variants [ 19 , 20 , 21 , 22 ]. The transformer model leverages attention mechanisms to process the input data in parallel, allowing for faster and more efficient computations. The architecture of the transformer model is composed of a stack of identical transformer modules, each with two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. Using this architecture, transformer models can capture the dependencies between inputs and outputs without regard for their distance in the sequence.

The potential of transformer-based architectures and attention mechanisms in genome data analysis is vast and largely unexplored. They present a promising solution to tackle the massive scale and intricate nature of genomic data. The ability to capture long-range dependencies between genomic positions, consider multiple relevant genomic regions simultaneously, and adaptively focus on salient features makes these methods uniquely suited for genomic applications. This review paper seeks to highlight and investigate the innovative applications of these methods in genome data analysis, critically assess their advantages and limitations, and provide future research directions.

The surge of research in this domain has led to a voluminous influx of studies and publications, each contributing new findings, methods, and perspectives. While this rapid proliferation of research is a testament to the field’s dynamism, it also poses a challenge for researchers to keep pace with the advancements. Hence, the necessity for comprehensive review papers that curate, synthesize, and cohesively present these findings is paramount.

This review paper aims to provide a rigorous and up-to-date synthesis of the proliferating literature in this field. Given the swift pace of development in deep learning methodologies, it is critical to continually assess and reflect on the current standing and future direction of the research. This review will serve as a timely resource for both seasoned researchers and newcomers to the field, offering a panoramic view of the recent advancements and elucidating the state-of-the-art applications of transformer architectures and attention mechanisms in genome data analysis.

This review undertakes a systematic and critical assessment of the most recent studies spanning 2019 to 2023. Thoroughly examining these publications aims to provide novel perspectives, detect existing research gaps, and propose avenues for future investigation. Moreover, this review aims to highlight the far-reaching implications and potential benefits associated with the application of advanced deep learning techniques in the analysis of genome data. By investigating these advances, it seeks to inspire and stimulate further research endeavors and technological breakthroughs in the dynamic field of bioinformatics.

2. Deep Learning with Transformers and Attention Mechanism

2.1. conventional architectures of deep learning.

In recent years, the field of biomedicine has observed a significant upsurge in the application of machine learning and, more particularly, deep learning methods. These advanced techniques have been instrumental in unearthing insights from complex biomedical datasets, enabling progress in disease diagnosis, drug discovery, and genetic research.

Deep learning, or deep neural networks (DNNs), employs artificial neural networks with multiple layers, a feature that makes it remarkably capable of learning complex patterns from large datasets [ 23 ]. One of the simplest forms of a neural network is the multilayer perceptron (MLP), which contains an input layer, one or more hidden layers, and an output layer. MLPs are proficient at handling datasets where inputs and outputs share a linear or non-linear relationship. However, they are less effective when dealing with spatial or temporal data, a limitation overcome by more sophisticated deep learning models such as convolutional neural networks (CNNs) [ 24 ] and RNNs [ 25 ].

CNNs are exceptionally efficient at processing spatial data, such as images, due to their ability to capture local dependencies in data using convolutional layers. In biomedicine, CNNs have proved instrumental in tasks like medical image analysis and tissue phenotyping.

RNNs, including their advanced variant, long short-term memory (LSTM) networks, are designed to handle sequential data by incorporating a memory-like mechanism, allowing them to learn from previous inputs in the sequence. This property makes them valuable in predicting protein sequences or understanding genetic sequences in bioinformatics.

Generative adversarial networks (GANs), a game-changer in the field, consist of two neural networks, the generator and the discriminator, that compete [ 26 , 27 , 28 , 29 , 30 , 31 , 32 ]. This unique architecture enables the generation of new, synthetic data instances that resemble the training data, a feature that holds promise in drug discovery and personalized medicine.

Several other variants of deep learning techniques also exist. For instance, graph attention leverages the attention mechanism to weigh the influence of nodes in a graph, playing a crucial role in molecular biology for structure recognition. Residual networks (ResNets) use shortcut connections to solve the problem of vanishing gradients in deep networks, a feature that can be valuable in medical image analysis. AdaBoost, a boosting algorithm, works by combining multiple weak classifiers to create a strong classifier. Seq2Vec is an approach for sequence data processing where the sequence is converted into a fixed-length vector representation. Finally, variational autoencoders (VAE) are generative models that can learn a latent representation of the input data, offering significant potential in tasks like anomaly detection or dimensionality reduction in complex biomedical data.

2.2. Transformers and Attention Mechanism

The transformer model represents a watershed moment in the evolution of deep learning models [ 33 ]. Distinct from conventional sequence transduction models, which typically involve recurrent or convolutional layers, the transformer model solely harnesses attention mechanisms, setting a new precedent in tasks such as machine translation and natural language processing (NLP).

The principal component of a transformer model is the attention mechanism, and it comes in two forms: self-attention (also referred to as intra-attention) and multi-head attention. The attention mechanism’s core function is to model interactions between different elements in a sequence, thereby capturing the dependencies among them without regard to their positions in the sequence. In essence, it determines the extent to which to pay attention to various parts of the input when producing a particular output.

Self-attention mechanisms operate by creating a representation of each element in a sequence that captures the impact of all other elements in the sequence. This is achieved by computing a score for each pair of elements, applying a softmax function to obtain weights, and then using these weights to form a weighted sum of the original element representations. Consequently, it allows each element in the sequence to interact with all other elements, providing a more holistic picture of the entire sequence.

The multi-head attention mechanism, on the other hand, is essentially multiple self-attention mechanisms, or heads, operating in parallel. Each head independently computes a different learned linear transformation of the input, and their outputs are concatenated and linearly transformed to result in the final output. This enables the model to capture various types of relationships and dependencies in the data.

In addition to the self-attention mechanism, another critical aspect of the transformer architecture is the incorporation of positional encoding. Given that the model itself is permutation-invariant (i.e., it does not have any inherent notion of the order of the input elements), there is a necessity for some method to incorporate information about the position of the elements within the sequence. Positional encoding serves this purpose.

Positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. These embeddings are learned or fixed, and their purpose is to inject information about the relative or absolute positions of the words in the sequence. The addition of positional encodings enables the model to make use of the order of the sequence, which is critical for understanding structured data like language.

One common approach to positional encoding is to use sine and cosine functions of different frequencies. With this approach, each dimension of the positional encoding corresponds to a sine or cosine function. These functions have a wavelength that forms a geometric progression from 2 π to 10,000 × 2 π .

One of the key advantages of the transformer model is its ability to handle long-range dependencies in the data, an aspect where traditional RNNs and CNNs may struggle due to their sequential nature. By allowing all elements in the sequence to interact simultaneously, transformers alleviate the need for compressing all information into a fixed-size hidden state, which often leads to information loss in long sequences.

Additionally, transformers also introduce the concept of position encoding to counter the absence of inherent positional information in attention mechanisms. This is crucial, especially in tasks where the order of the elements carries significant information.

The transformer’s self-attention mechanism involves three crucial components: the query (Q), key (K), and value (V). These components originate from the input representations and are created by multiplying the input by the respective learned weight matrices. Each of these components carries a unique significance in the attention mechanism.

In detail, the query corresponds to the element for which we are trying to compute the context-dependent representation. The key relates to the elements that we are comparing the query against to determine the weights. Finally, the value is the element that gets weighted by the attention score (resulting from the comparison of the query with the key) to generate the final output.

The self-attention mechanism operates by calculating an attention score for a pair of query and key. It does so by taking their dot product and then applying a softmax function to ensure that the weights fall into the range of zero and one and sum to one. This provides a normalized measure of importance, or attention, that the model assigns to each element when encoding a particular element.

Following the calculation of attention scores, the model computes a weighted sum of the value vectors, where the weights are given by the attention scores. This operation results in the context-sensitive encoding of each element, where the context depends on all other elements in the sequence. Such encodings are then used as inputs to the next layer in the transformer model.

The use of the Q, K, and V matrices allows the model to learn to focus on different aspects of the input data and enables it to discern which pieces of information are critical when encoding a particular element. As such, the transformer’s attention mechanism brings a significant degree of flexibility and power to the model, allowing it to handle a wide variety of tasks in an efficient and effective manner. The structures of the transformer architecture and the attention mechanism are depicted in Figure 1 .

An external file that holds a picture, illustration, etc.
Object name is biology-12-01033-g001.jpg

Illustration of the transformer architecture and the attention mechanism. ( A ) Transformer structure; ( B ) Attention mechanism.

3.1. Publication Selection Process

The paper selection process was designed to ensure the inclusion of high-quality research contributions pertinent to our review’s focal area. To achieve this, we primarily leveraged algorithmic approaches relying on the academic search engine, Web of Science (WOS). We carefully chose our search keywords, focusing on core terminologies such as “deep learning transformer”, “attention method”, “RNAs”, and “genome data”. This meticulous selection of search keywords was instrumental in identifying relevant articles for inclusion in our review.

To present a structured overview of the transformer architecture and the attention mechanism in the context of genomic data, we classified the selected papers according to research topic, i.e., the specific application of transformers and attention methods. This classification aims to contribute to a comprehensive understanding of the intersection of deep learning with transformers and attention mechanisms for genomic data. It accentuates the comprehension of various methodologies employed within the field. For a concise summary of the reviewed papers, refer to Table 1 . We acknowledge that several papers could fit multiple categories, but for the purpose of this review, each paper was classified under a single category that best captures the paper’s core theme.

Overview of Applications of Transformer Architecture and Attention Mechanism for Genome Data.

Our review focuses solely on peer-reviewed journal articles, knowingly excluding preprints and conference papers, despite their abundance in the field. We enforced this criterion to uphold the reliability and validity of the review, thereby ensuring that only studies subjected to rigorous peer-review scrutiny were included. We aimed to maintain the novelty and originality of the review and, thus, intentionally excluded specific types of articles, such as review articles and perspectives. The objective was to emphasize primary research-based studies as per our review’s intent.

We limited our review’s temporal span to articles published from 2019 to 2023. This constraint ensures that our review remains concurrent and relevant, providing a comprehensive understanding of the most recent advancements and trends in the field of deep learning for genomic data. It is noteworthy that our review focuses solely on peer-reviewed journal articles. This decision was driven by two main factors: First, the peer review process is a crucial mechanism to uphold the quality and reliability of the scientific literature by subjecting research to rigorous examination by domain experts. Second, peer-reviewed journals are traditionally deemed reliable and trusted sources for publishing scientifically sound and influential research.

We carried out data collection for 2023 up until May, aligning with our current schedule, thereby ensuring that the review’s currency aligns with the field’s latest developments. During the data collection process, we compiled information on the number of citations and publication logs for each selected article. These data were paramount for evaluating the scope, impact, and acceptance of the research within the scientific community. We shall analyze these data in the subsequent sections of this review.

Certain studies were excluded based on specific criteria. Review articles were not considered due to our focus on primary research. Studies employing only machine learning methodologies without deep learning elements were also excluded. Furthermore, papers that did not directly relate to genomic data, such as those focusing on image segmentation, were left out, despite the general applicability of the attention mechanism to image data. Hence, such image-related studies were manually removed from our review.

3.2. Journals of Published Papers

Table 2 illustrates the distribution of published articles focusing on the application of the transformer architecture and the attention mechanism for genome data, across a variety of scientific journals.

Distribution of Published Articles across Different Journals.

It is evident from Table 2 that ‘Briefings in Bioinformatics’ has the highest number of publications (20), constituting 16.1% of the total studies in this domain. The ‘Bioinformatics’, ‘BMC Bioinformatics’, and ‘Frontiers in Genetics’ journals follow closely, each contributing 7.3% of the total publications. Journals such as ‘PLOS Computational Biology’, ‘Nature Communications’, and ‘Interdisciplinary Sciences-Computational Life Sciences’ account for about 3.2% each.

Furthermore, there is a considerable portion of articles (27.4%) distributed in various other journals, each contributing fewer than two publications. These results exhibit a wide dissemination of research on this topic across various journals, suggesting cross-disciplinary interest and influence of transformer architecture and attention mechanism applications within the field of genome data analysis.

3.3. Year-Wise Analysis of Publications

As illustrated in Figure 2 A, the trend of publications on the use of the transformer architecture and the attention mechanism for genome data shows a significant increase over the past few years. The growth and intensity of publications, year after year, illustrate the fast-emerging interest and intensive research activity in this field.

An external file that holds a picture, illustration, etc.
Object name is biology-12-01033-g002.jpg

Distribution Patterns of Publication Years and Citation Frequencies ( A ) Distribution of Publication Years. ( B ) Distribution of Citation Frequencies. ( C ) Relationship between Citations and Publication Year.

In 2019, the number of publications was relatively small, with only four documented studies, indicating the nascent stage of research in this area. However, the number of publications experienced a substantial rise by more than double, to nine, in 2020. This indicates the field’s emerging development, as more research communities started recognizing the transformative potential of transformer architectures and attention mechanisms for genome data analysis.

The year 2021 marked a significant breakthrough in this field, with 32 publications, more than three times as many as in 2020. This sudden surge can be attributed to the maturation of the methodologies and the growing acknowledgment of their utility and effectiveness in genome data interpretation.

In 2022, the research activity peaked with a record high of 59 publications, indicating a major turning point and signifying the field’s transition into a more mature phase. The proliferation of these techniques in genome data analysis could be attributed to their profound ability to handle large genomic datasets and generate meaningful biological insights.

In 2023, up until May, there have already been 20 publications, indicating a continued strong interest in the field. Despite being only partway through the year, the number of publications has reached approximately one-third of the total for 2022, suggesting that the momentum of research in this area is expected to continue.

The upward trend in the number of publications over the years signifies the growing acknowledgment and adoption of transformer architecture and attention mechanism techniques in genome data analysis. It underscores the importance of further research to leverage these promising deep learning methodologies for more advanced, precise, and insightful interpretation of complex genomic data.

3.4. Analysis of Citation Distribution

The citation distribution of the reviewed papers provides insightful data about their scholarly impact and recognition within the academic community. As depicted in Figure 2 B, which illustrates the histogram of citations, and Figure 2 C, which represents the correlation between the number of citations and the publication year of papers, there is a notable pattern in citation distribution.

The median number of citations is 2, and the mean is 9.7, suggesting a positively skewed distribution of citations. This skewness indicates that while most papers receive few citations, a minority of papers are highly cited, which considerably raises the mean. It is noteworthy that a large number of studies have not been cited yet, primarily because they have been recently published and have not had adequate time for review. This scenario underscores the significance of the present review, which aims to provide a thorough examination of these studies.

Considering the incomplete citation data for 2023, it is apparent that almost every paper published this year has not been cited yet, with a median citation count of zero. This observation aligns with the expected academic trend where newer publications generally have fewer citations due to the time lag inherent in the citation process.

However, earlier publications exhibit a higher citation count, signifying their broader impact and established status in the field. For instance, the median citation count for the papers published in 2019 and 2020 is 42 and 17, respectively. This shows a substantial scholarly impact, demonstrating that the topic reviewed here is of considerable interest and value to the research community.

In this regard, a few highly cited papers have made a particularly significant impact on the field. For example, the work by Armenteros et al. [ 141 ], which introduced TargetP 2.0, a state-of-the-art method to identify N-terminal sorting signals in proteins using deep learning, has garnered significant attention, with 333 citations to date. The attention layer of their deep learning model highlighted that the second residue in the protein, following the initial methionine, has a strong influence on classification, a feature not previously emphasized. This highlights how deep learning methods can generate novel insights into biological systems.

Another influential paper is the work by Manica et al. [ 137 ], which proposed a novel architecture for interpretable prediction of anticancer compound sensitivity using a multi-modal attention-based convolutional encoder. This work received 56 citations, and its predictive model significantly outperformed the previous state-of-the-art model for drug sensitivity prediction. The authors also provided a comprehensive analysis of the attention weights, further demonstrating the interpretability of the approach.

Lastly, the study by Angenent-Mari et al. [ 74 ], which used deep learning to predict the behavior of engineered RNA elements known as toehold switches, also stands out. With 50 citations, this work showed that DNNs trained on nucleotide sequences vastly outperformed previous models based on thermodynamics and kinetics.

These highly cited works underscore the transformative potential of deep learning methods, particularly those leveraging the transformer architecture and attention mechanisms, in enhancing our understanding of biological systems and in advancing predictive modeling in biomedicine. The citation distribution reflects the temporal dynamics of the field’s influence and the increasing recognition of deep learning with transformer architecture and attention mechanism techniques in genome data analysis. Further reviews and analyses of recent papers are required to stimulate discussion and increase their visibility and impact within the academic community.

4. Overview of Recent Studies in Transformer Architectures and Attention Mechanisms for Genome Data

4.1. sequence and site prediction.

In pre-miRNA prediction, Raad et al. [ 34 ] introduced miRe2e, a deep learning model based on transformers. The model demonstrated a ten-fold improvement in performance compared to existing algorithms when validated using the human genome. Similarly, Zeng et al. [ 38 ] introduced 4mCPred-MTL, a multi-task learning model coupled with a transformer for predicting 4mC sites across multiple species. The model demonstrated a strong feature learning ability, capturing better characteristics of 4mC sites than existing feature descriptors.

Several studies have leveraged deep learning for RNA–protein binding preference prediction. Shen et al. [ 35 ] developed a model based on a hierarchical LSTM and attention network which outperformed other methods. Du et al. [ 42 ] proposed a deep multi-scale attention network (DeepMSA) based on CNNs to predict the sequence-binding preferences of RNA-binding proteins (RBPs). Pan et al. [ 43 ] developed a deep learning model, CRMSNet, that combined CNN, ResNet, and multi-head self-attention blocks to predict RBPs for RNA sequences.

The work by Sun et al. [ 68 ] presents a deep learning tool known as PrismNet, designed for predicting RBP interactions, which are integral to RNA function and cellular regulation. This tool stands out as it was built to reflect the dynamic and condition-dependent nature of RBP–RNA interactions, in contrast to existing tools that primarily rely on RNA sequences or predicted RNA structures. The study proposed PrismNet by integrating experimental in vivo RNA structure data with RBP binding data from seven different cell types. This method enables accurate prediction of dynamic RBP binding across diverse cellular conditions.

An important aspect that distinguishes PrismNet is the application of an attention mechanism that identifies specific RBP-binding nucleotides computationally. The study found enrichment of structure-changing variants (termed riboSNitches) among these dynamic RBP-binding sites, potentially offering new insights into genetic diseases associated with dysregulated RBP bindings. Thus, PrismNet provides a method to access previously inaccessible layers of cell-type-specific RBP–RNA interactions, potentially contributing to our understanding and treatment of human diseases. Despite its merits, PrismNet also has potential limitations. For example, the effectiveness of PrismNet relies heavily on the quality and quantity of experimental in vivo RNA structure data and RBP-binding data. This dependence could limit its usefulness in scenarios where such extensive datasets are not available or are incomplete. Furthermore, while PrismNet uses an attention mechanism to identify exact RBP-binding nucleotides, interpreting these attention scores in the biological context may not be straightforward, requiring additional investigation or expertise.

Li et al. [ 36 ] proposed an ensemble deep learning model called m6A-BERT-Stacking to detect m6A sites in various tissues of three species. The experimental results demonstrated that m6A-BERT-Stacking outperformed most existing methods based on the same independent datasets. Similarly, Tang et al. [ 41 ] presented Deep6mAPred, a deep learning method based on CNN and Bi-LSTM for predicting DNA N6-methyladenosine sites across plant species.

For promoter recognition, Ma et al. [ 37 ] proposed a deep learning algorithm, DeeProPre. The model demonstrated high accuracy in identifying the promoter region of eukaryotes. Mai et al. [ 39 ] employed and compared the performance of popular NLP models, including XLNET, BERT, and DNABERT, for promoter prediction in freshwater cyanobacterium Synechocystis sp. PCC 6803 and Synechococcus elongatus sp. UTEX 2973.

In predicting RNA solvent accessibility, Huang et al. [ 45 ] proposed a sequence-based model using only primary sequence data. The model employed modified attention layers with different receptive fields to conform to the stem-loop structure of RNA chains. Fan et al. [ 62 ] proposed a novel computational method called M(2)pred for accurately predicting the solvent accessibility of RNA. The model utilized a multi-shot neural network with a multi-scale context feature extraction strategy.

To predict transcription factor binding sites, Bhukya et al. [ 56 ] proposed two models, PCLAtt and TranAtt. The model outperformed other state-of-the-art methods like DeepSEA, DanQ, TBiNet, and DeepATT in the prediction of binding sites between transcription factors and DNA sequences. Cao et al. [ 51 ] proposed DeepARC, an attention-based hybrid approach that combines a CNN and an RNN for predicting transcription factor binding sites.

Muneer et al. [ 57 ] proposed two deep hybrid neural network models, namely GCN_GRU and GCN_CNN, for predicting RNA degradation from RNA sequences. In the prediction of RNA degradation, He et al. [ 52 ] introduced RNAdegformer, a model architecture for predicting RNA degradation. RNAdegformer outperformed previous best methods at predicting degradation properties at nucleotide resolution for COVID-19 mRNA vaccines.

In the identification of pseudouridine (psi) sites, Zhuang et al. [ 44 ] developed PseUdeep, a deep learning framework for identifying psi sites in three species: H. sapiens, S. cerevisiae, and M. musculus. The model uses a modified attention mechanism with different receptive fields to conform to the stem-loop structure of RNA chains.

In the prediction of miRNA-disease associations, Zhang et al. [ 54 ] developed the Deep Attentive Encoder–Decoder Neural Network (D-AEDNet) to identify the location of transcription factor binding sites (TFBSs) in DNA sequences. Xie et al. [ 61 ] presented a new computational method based on positive point-wise mutual information (PPMI) and an attention network to predict miRNA-disease associations (MDAs), called PATMDA. Liang et al. [ 59 ] developed a deep learning model, DeepEBV, to predict Epstein–Barr virus (EBV) integration sites. The model leverages an attention-based mechanism to learn local genomic features automatically.

Recent studies have shown a growing interest in utilizing attention mechanisms for analyzing genome data. Attention-based models have gained popularity due to their ability to capture informative patterns and long-range dependencies in genomic sequences. These models have been applied to various tasks, including sequence and site prediction, RNA-protein binding preference prediction, survival prediction, and identification of functional elements in the genome. The use of attention mechanisms in these studies has demonstrated improved performance and accuracy, highlighting the effectiveness of this approach in extracting meaningful information from genome data.

4.2. Gene Expression and Phenotype Prediction

Deep learning models have been extensively employed to predict gene expression and phenotypes, demonstrating significant improvements over traditional methods. These models have been particularly effective in capturing complex gene–gene and gene–environment interactions and integrating diverse types of genomic and epigenomic data.

A particularly noteworthy study in gene expression and phenotype prediction is that of Angenent-Mari et al. [ 74 ]. Their work explores the application of DNNs for the prediction of the function of toehold switches, which serve as a vital model in synthetic biology. These switches, engineered RNA elements, can detect small molecules, proteins, and nucleic acids. However, the prediction of their behavior has posed a considerable challenge—a situation that Angenent-Mari and colleagues sought to address through enhanced pattern recognition from deep learning.

The methodology employed by the authors involved the synthesis and characterization of a dataset comprising 91,534 toehold switches, spanning 23 viral genomes and 906 human transcription factors. The DNNs trained on these nucleotide sequences notably outperformed prior state-of-the-art thermodynamic and kinetic models in the prediction of the toehold switch function. Further, the authors introduced human-understandable attention-visualizations (VIS4Map) which facilitated the identification of successful and failure modes. The network architecture comprised MLP, CNN, and LSTM networks trained on various inputs, including one-hot encoded sequences and rational features. An ensemble MLP model was also proposed, incorporating both the one-hot encoded sequences and rational features.

The advantages of this method are manifold. The authors leveraged deep learning to predict the function of toehold switches, a task that had previously presented considerable challenges. The outperformance of prior state-of-the-art models is a testament to the efficacy of the proposed approach. Furthermore, the inclusion of VIS4Map attention-visualizations enhances the interpretability of the model, providing valuable insights into the model’s workings and facilitating the identification of areas of success and those that need improvement. Despite these significant strides, the methodology also bears certain limitations. The training process is computationally demanding, necessitating high-capacity hardware and graphic processing units which may not be accessible to all researchers. Furthermore, as with any model, the generalizability of this approach to other classes of RNA or DNA elements remains to be validated. It is also worth noting that while the model outperforms previous models, there is still considerable room for improvement, as the highest R-squared value achieved was 0.70, indicating that the model could explain 70% of the variability in the data.

A key area of focus has been the prediction of gene expression based on histone modifications. Lee et al. [ 70 ] developed Chromoformer, a transformer-based deep learning architecture considering large genomic windows and three-dimensional chromatin interactions. Similarly, Chen et al. [ 71 ] introduced TransferChrome, a model that uses a densely connected convolutional network and self-attention layers to aggregate global features of histone modification data. Liao et al. [ 73 ] also proposed a hybrid convolutional and bi-directional long short-term memory network with an attention mechanism for this task. These models have demonstrated their ability to predict gene expression levels based on histone modification signals accurately.

Several studies have also focused on predicting gene expression and phenotypes based on other genomic and epigenomic data types. For instance, Zhang et al. [ 69 ] developed T-GEM, an interpretable deep learning model for gene-expression-based phenotype predictions. Kang et al. [ 72 ] proposed a multi-attention-based deep learning model that integrates multiple markers to characterize complex gene regulation mechanisms. These models have shown their ability to integrate diverse data types and capture complex interactions, leading to improved prediction performance.

Several studies have also focused on the prediction of specific types of phenotypes. For instance, Lee et al. [ 79 ] proposed BP-GAN, a model that uses generative adversarial networks (GANs) combined with an attention mechanism for predicting RNA Branchpoints (BPs). These studies have shown the potential of deep learning models in predicting specific types of phenotypes.

Recent studies have focused on utilizing deep learning models with attention mechanisms to predict gene expression and phenotypes based on diverse genomic and epigenomic data. These models have shown improvements over traditional methods by capturing complex gene–gene and gene–environment interactions and integrating various data types. Specifically, attention-based models have been employed to predict gene expression levels using histone modification data, such as Chromoformer [ 70 ], TransferChrome [ 71 ], and a hybrid convolutional and bi-directional LSTM network [ 73 ]. Additionally, researchers have explored the prediction of specific phenotypes, such as toehold switch functions [ 74 ] and RNA Branchpoints [ 79 ], showcasing the versatility and potential of deep learning with attention mechanisms in gene expression and phenotype prediction.

4.3. ncRNA and circRNA Studies

The application of deep learning models, particularly those incorporating transformer architectures and attention mechanisms, has been extensively explored in the study of non-coding RNAs (ncRNAs) and circular RNAs (circRNAs). These models have shown promising results in predicting ncRNA-disease associations, lncRNA–protein interactions, and circRNA-RBP interactions, among other tasks.

Yang et al. [ 93 ] presented a novel computational method called iCircRBP-DHN that leverages a deep hierarchical network to distinguish circRNA–RBP-binding sites. The core of this approach is a combination of a deep multi-scale residual network and bidirectional gated recurrent units (BiGRUs) equipped with a self-attention mechanism. This architecture simultaneously extracts local and global contextual information from circRNA sequences. The study proposed two novel encoding schemes to enrich the feature representations. The first, KNFP (K-tuple Nucleotide Frequency Pattern), is designed to capture local contextual features at various scales, effectively addressing the information insufficiency issue inherent in conventional one-hot representation. The second, CircRNA2Vec, is based on the Doc2Vec algorithm and aims to capture global contextual features by modeling long-range dependencies in circRNA sequences. This method treats sequences as a language and maps subsequences (words) into distributed vectors, which contribute to capturing the semantics and syntax of these sequences. The effectiveness of iCircRBP-DHN was validated on multiple circRNAs and linear RNAs datasets, and it showed superior performance over state-of-the-art algorithms.

While iCircRBP-DHN exhibits several advantages, it also presents potential limitations. The method’s strengths include its ability to model both local and global contexts within sequences, its robustness against numerical instability, and its scalability, demonstrated by the performance on extensive datasets. However, the method’s performance is heavily reliant on the quality of sequence data and the effectiveness of the CircRNA2Vec and KNFP encoding schemes, which might not capture all nuances of circRNA–RBP interactions. While the self-attention mechanism can provide some insights into what the model deems important, it might not provide a full explanation of the reasoning behind the model’s predictions.

Several studies have focused on predicting lncRNA–disease associations. Liu et al. [ 83 ] developed a dual attention network model, which uses two attention layers, for this task, outperforming several latest methods. Similarly, Gao and Shang [ 89 ] proposed a new computational model, DeepLDA, which used DNNs and graph attention mechanisms to learn lncRNA and drug embeddings for predicting potential relationships between lncRNAs and drug resistance. Fan et al. [ 97 ] proposed GCRFLDA, a novel lncRNA–disease association prediction method based on graph convolutional matrix completion. Sheng et al. [ 98 ] developed VADLP, a model designed to predict lncRNA–disease associations using an attention mechanism. These models have demonstrated their ability to accurately predict lncRNA–disease associations, providing valuable insights into the roles of lncRNAs in disease development and progression.

In addition to predicting lncRNA–disease associations, deep learning models have also been used to predict lncRNA–protein interactions. Song et al. [ 84 ] presented an ensemble learning framework, RLF-LPI, for predicting lncRNA–protein interactions. Wekesa et al. [ 85 ] developed a graph representation learning method, GPLPI, for predicting plant lncRNA–protein interactions (LPIs) from sequence and structural information. These models have shown their ability to capture dependencies between sequences and structures, leading to improved prediction performance.

In the task of distinguishing circular RNA (circRNA) from other long non-coding RNA (lncRNA), Liu et al. [ 101 ] proposed an attention-based multi-instance learning (MIL) network. The model outperformed state-of-the-art models in this task.

Several studies have also focused on the prediction of circRNA–RBP interactions. Wu et al. [ 86 ] proposed an RBP-specific method, iDeepC, for predicting RBP-binding sites on circRNAs from sequences. Yuan and Yang [ 90 ] developed a deep learning method, DeCban, to identify circRNA–RBP interactions. Niu et al. [ 99 ] proposed CRBPDL, a calculation model that employs an Adaboost integrated deep hierarchical network to identify binding sites of circular RNA–RBP. These models have demonstrated their ability to accurately predict circRNA–RBP interactions, providing valuable insights into the roles of circRNAs in post-transcriptional regulation. Guo et al. [ 102 ] proposed a deep learning model, circ2CBA, for predicting circRNA–RBP-binding sites. The model achieved an AUC value of 0.8987, outperforming other methods in predicting the binding sites between circRNAs and RBPs.

In addition to predicting interactions, deep learning models have also been used to predict and interpret post-transcriptional RNA modifications and ncRNA families. Song et al. [ 91 ] presented MultiRM, a method for the integrated prediction and interpretation of post-transcriptional RNA modifications from RNA sequences. Chen et al. [ 92 ] developed ncDENSE, a deep-learning-model-based method for predicting and interpreting non-coding RNAs families from RNA sequences. These models have shown their ability to accurately predict and interpret RNA modifications and ncRNA families, providing valuable insights into the roles of these modifications and families in gene regulation.

Several studies have also focused on predicting circRNA–disease associations. Li et al. [ 96 ] proposed a method called GATGCN that utilizes a graph attention network and a convolutional graph network (GCN) to explore human circRNA–disease associations based on multi-source data. Wang et al. [ 95 ] proposed CDA-SKAG, a deep learning model for predicting circRNA–disease associations. Li et al. [ 94 ] introduced a deep learning model, GGAECDA, to predict circRNA–disease associations. These models have demonstrated their ability to accurately predict circRNA–disease associations, providing valuable insights into the roles of circRNAs in disease development and progression.

Recent studies have focused on utilizing deep learning models with transformer architectures and attention mechanisms for the analysis of ncRNAs and circRNAs. These models have shown promise in various tasks, including the prediction of ncRNA–disease associations, lncRNA–protein interactions, circRNA–RBP interactions, and the identification of RNA modifications and ncRNA families. The integration of attention mechanisms in these models has improved prediction accuracy and facilitated the interpretation of complex interactions and patterns in genomic data.

4.4. Transcription Process Insights

In recent advancements, deep learning, specifically attention mechanisms and transformer models, have been significantly employed in decoding the transcription process of genome data. Clauwaert et al. [ 103 ], Park et al. [ 108 ], and Han et al. [ 105 ] have proposed transformative models centered on transcription factor (TF)-binding site prediction and characterization.

As one of the specific examples, Yan et al. [ 109 ] introduced an innovative deep learning framework for circRNA–RBP-binding site discrimination, referred to as iCircRBP-DHN, Integrative Circular RNA–RBP-binding sites Discrimination by Hierarchical Networks. They addressed common issues with previous computational models, such as poor scalability and numerical instability, and developed a transformative method that amalgamates local and global contextual information via deep multi-scale residual network BiGRUs with a self-attention mechanism.

One of the key advantages of this approach is the fusion of two encoding schemes, CircRNA2Vec and the K-tuple nucleotide frequency pattern, which allows for the representation of different degrees of nucleotide dependencies, enhancing the discriminative power of feature representations. The robustness and superior performance of this method were evidenced through extensive testing on 37 circRNA datasets and 31 linear RNA datasets, where it outperformed other state-of-the-art algorithms.

Clauwaert et al. [ 103 ] used a transformer-based neural network framework for prokaryotic genome annotation, primarily focusing on Escherichia coli. The study emphasized that a substantial part of the model’s subunits or attention heads were attuned to identify transcription factors and characterize their binding sites and consensus sequences. This method opened the door to understanding well-known and possibly novel elements involved in transcription initiation. Furthering the area of TF-binding site prediction, Park et al. [ 108 ] introduced TBiNet, an attention-based deep neural network model that quantitatively outperformed state-of-the-art methods and demonstrated increased efficiency in discovering known TF-binding motifs. This study aimed to augment the interpretability of TF-binding site prediction models, an aspect critical to comprehending gene regulatory mechanisms and identifying disease-associated variations in non-coding regions. Han et al. [ 105 ] proposed MAResNet, a deep learning method combining bottom-up and top-down attention mechanisms and a ResNet to predict TF-binding sites. The model’s robust performance on a vast test dataset reaffirmed the potency of attention mechanisms in capturing complex patterns in genomic sequences.

Another interesting application of deep learning is seen in the study by Feng et al. [ 104 ], where they developed a model, PEPMAN, that predicts RNA polymerase II pausing sites based on NET-seq data, which are data from a high-throughput technique used to precisely map and quantify nascent transcriptional activity across the genome. PEPMAN utilized attention mechanisms to decipher critical sequence features underlying the pausing of Pol II. Their model’s predictions, in association with various epigenetic features, delivered enlightening insights into the transcription elongation process.

Regarding RNA localization, Asim et al. [ 107 ] developed EL-RMLocNet, an explainable LSTM network for RNA-associated multi-compartment localization prediction, utilizing a novel GeneticSeq2Vec statistical representation learning scheme and an attention mechanism. This model surpassed the existing state-of-the-art predictor for subcellular localization prediction.

In predicting RBP-binding sites, Song et al. [ 110 ] proposed AC-Caps, an attention-based capsule network. The model achieved high performance, with an average AUC of 0.967 and an average accuracy of 92.5%, surpassing existing deep-learning models and proving effective in processing large-scale RBP-binding site data.

Tao et al. [ 106 ] presented a novel application in oncology; they developed an interpretable deep learning model, CITRUS, which inferred transcriptional programs driven by somatic alterations across different cancers. CITRUS utilized a self-attention mechanism to model the contextual impact of somatic alterations on TFs and downstream transcriptional programs. It revealed relationships between somatic alterations and TFs, promoting personalized therapeutic decisions in precision oncology.

Deep learning models with attention mechanisms and transformer architectures have emerged as powerful tools for gaining insights into the transcription process and decoding genome data. These models have been applied to various tasks, such as TF-binding site prediction and characterization. Many studies have proposed transformative models that utilize attention mechanisms to identify TFs, characterize their binding sites, and understand gene regulatory mechanisms. Additionally, deep learning models have been employed to predict RNA polymerase II pausing sites, RNA localization, RBP-binding sites, and transcriptional programs driven by somatic alterations in cancer. These studies highlight the effectiveness of attention mechanisms in capturing complex patterns in genomic sequences and providing valuable insights into the transcription process and gene regulation.

4.5. Multi-Omics/Modal Tasks

Exploring and integrating multi-omics and multi-modal data are substantial tasks in understanding complex biological systems. Deep learning methods, particularly attention mechanisms and transformer models, have seen profound advancements and deployments in this regard. Studies by Gong et al. [ 111 ], Kayikci and Khoshgoftaar [ 112 ], Ye et al. [ 113 ], and Wang et al. [ 115 ] have extensively utilized such methods for biomedical data classification and disease prediction.

In the study by Kang et al. [ 114 ], a comprehensive ensemble deep learning model for plant miRNA–lncRNA interaction prediction is proposed, namely PmliPEMG. This method introduces a fusion of complex features, multi-scale convolutional long short-term memory (ConvLSTM) networks, and attention mechanisms. Complex features, built using non-linear transformations of sequence and structure features, enhance the sample information at the feature level. By forming a matrix from the complex feature vector, the ConvLSTM models are used as the base model, which is beneficial due to their ability to extract and memorize features over time. Notably, the models are trained on three matrices with different scales, thus enhancing sample information at the scale level.

An attention mechanism layer is incorporated into each base model, assigning different weights to the output of the LSTM layer. This attentional layer allows the model to focus on crucial information during training. Finally, an ensemble method based on a greedy fuzzy decision strategy is implemented to integrate the three base models, improving efficiency and generalization ability. This approach exhibits considerable advantages. Firstly, the use of multi-level information enhancement ensures a more comprehensive understanding of the underlying data, increasing the robustness of the method. The greedy fuzzy decision enhances the model’s efficiency and overall generalization ability. Furthermore, the application of attention mechanisms allows the model to focus on the most informative features, improving predictive accuracy.

Gong et al. [ 111 ] proposed MOADLN, a multi-omics attention deep learning network, which is adept at exploring correlations within and across different omics datasets for biomedical data classification. This methodology showcased its effectiveness in deep-learning-based classification tasks. Kayikci and Khoshgoftaar [ 112 ] proposed AttentionDDI, a gated attentive multi-modal deep learning model for predicting breast cancer by integrating clinical, copy number alteration, and gene expression data. It demonstrated the potential for significant improvements in breast cancer detection and diagnosis, suggesting better patient outcomes. Ye et al. [ 113 ] implemented a novel gene prediction method using a Siamese neural network, a deep learning architecture that employs twin branches with shared weights to compare and distinguish similarity or dissimilarity between input samples, containing a lightweight attention module for identifying ovarian cancer causal genes. This approach outperformed others in accuracy and effectiveness. Similarly, Wang et al. [ 115 ] proposed a deep neural network model that integrates multi-omics data to predict cellular responses to known anti-cancer drugs. It employs a novel graph embedding layer and attention layer that efficiently combines different omics features, accounting for their interactions.

Chan et al. [ 116 ] proposed a deep neural network architecture combining structural and functional connectome data, which refers to the comprehensive mapping and analysis of neural connections within the brain, with multi-omics data for disease classification. They utilized graph convolution layers for the simultaneous modeling of functional Magnetic Resonance Imaging (fMRI) and Diffusion Tensor Imaging (DTI) data, which are neuroimaging techniques used to, respectively, measure blood flow changes and diffusion patterns within the brain; and separate graph convolution layers for modeling multi-omics datasets. An attention mechanism was used to fuse these outputs, highlighting which omics data contributed the most to the classification decision. This approach demonstrated a high efficacy in Parkinson’s disease classification using various combinations of multi-modal imaging data and multi-omics data.

These studies highlight the potential of attention mechanisms and transformer models in decoding complex biological systems and addressing multi-omics and multi-modal challenges in genomics research.

4.6. CRISPR Efficacy and Outcome Prediction

The efficacy and outcome prediction of CRISPR-Cas9 gene editing have significantly improved due to the development of sophisticated deep learning models. Several studies, including Liu et al. [ 118 ], Wan and Jiang [ 119 ], Xiao et al. [ 120 ], Mathis et al. [ 121 ], Zhang et al. [ 122 ], and Zhang et al. [ 123 ], have extensively used such models to predict CRISPR-Cas9 editing outcomes, single guide RNAs (sgRNAs) knockout efficacy, and off-target activities, enhancing the precision of gene editing technologies.

The research by Zhang et al. [ 123 ] introduced a novel method for predicting on-target and off-target activities of CRISPR/Cas9 sgRNAs. They proposed two deep learning models, CRISPR-ONT and CRISPR-OFFT, which incorporate an attention-based CNN to focus on sequence elements most decisive in sgRNA efficacy. These models offer several key advantages. First, they utilize an embedding layer that applies k-mer encoding to transform sgRNA sequences into numerical values, allowing the CNN to extract feature maps. This technique has been demonstrated to outperform other methods in sequential analysis. Second, these models use attention mechanisms to improve both prediction power and interpretability, focusing on the elements of the input sequence that are the most relevant to the output. This mirrors how RNA-guide Cas9 nucleases scan the genome, enhancing the realism of the model.

Liu et al. [ 118 ] presented Apindel, a deep learning model utilizing the GloVe model, a widely used unsupervised learning algorithm that captures the semantic relationships between words by analyzing the global statistical co-occurrence patterns of words within a large corpus. By integrating the GloVe, positional encoding, and a deep learning model embedding BiLSTM and attention mechanism, the proposed model predicts CRISPR-Cas9 editing outcomes by capturing the semantic relationships. It outperformed most advanced models in DNA mutation prediction and provided more detailed prediction categories. In the same vein, Wan and Jiang [ 119 ] introduced TransCrispr, a model combining transformer and CNN architectures for predicting sgRNA knockout efficacy in the CRISPR-Cas9 system. The model exhibited superior prediction accuracy and generalization ability when tested on seven public datasets.

Moreover, Xiao et al. [ 120 ] proposed AttCRISPR, an interpretable spacetime model for predicting the on-target activity of sgRNA in the CRISPR-Cas system. The model incorporated encoding-based and embedding-based methods using an ensemble learning strategy and achieved a superior performance compared to state-of-the-art methods. Notably, the model incorporated two attention modules, one spatial and one temporal, to enhance interpretability. Similarly, Liu et al. [ 117 ] developed an interpretable machine learning model for predicting the efficiency and specificity of the CRISPR-Cas system.

Mathis et al. [ 121 ] utilized attention-based bidirectional RNNs to develop PRIDICT, an efficient model for predicting prime editing outcomes. The model demonstrated reliable predictions for small-sized genetic alterations and highlighted the robustness of PRIDICT in improving prime editing efficiencies across various cell types.

In line with off-target activities prediction, Zhang et al. [ 122 ] presented a novel model, CRISPR-IP, for effectively harnessing sequence pair information to predict off-target activities within the CRISPR-Cas9 gene editing system. Their methodology integrated CNN, BiLSTM, and the attention layer, demonstrating superior performance compared to existing models.

Recent studies have made significant advancements in predicting the efficacy and outcomes of CRISPR-Cas9 gene editing using deep learning models. These models have demonstrated superior accuracy and performance in predicting CRISPR-Cas9 editing outcomes, sgRNA knockout efficacy, and off-target activities. The integration of attention mechanisms in these models has improved interpretability and provided valuable insights into the mechanisms of CRISPR-Cas9 gene editing.

4.7. Gene Regulatory Network Inference

The emergence of deep learning has revolutionized the inference of gene regulatory networks (GRNs) from single-cell RNA-sequencing (scRNA-seq) data, underscoring the utility of transformative machine learning architectures such as the attention mechanism and transformers. Prominent studies, including Lin and Ou-Yang [ 124 ], Xu et al. [ 125 ], Feng et al. [ 126 ], Ullah and Ben-Hur [ 127 ], and Xie et al. [ 128 ], have utilized these architectures to devise models for GRN inference, highlighting their superior performance compared to conventional methodologies.

The study by Ullah and Ben-Hur [ 127 ] presented a novel model, SATORI, for the inference of GRNs. SATORI is a Self-ATtentiOn-based model engineered to detect regulatory element interactions. SATORI leverages the power of deep learning through an amalgamation of convolutional layers and a self-attention mechanism. The convolutional layers, assisted by activation and max-pooling, process the input genomic sequences represented through one-hot encoding. The model further incorporates an optional RNN layer with long short-term memory units for temporal information capture across the sequence.

The multi-head self-attention layer in SATORI is its most pivotal component, designed to model dependencies within the input sequence irrespective of their relative distances. This feature enables the model to effectively capture transcription factor cooperativity. The model is trained and evaluated through a random search algorithm for hyperparameter tuning and the area under the ROC curve for performance measurement. One of the most distinctive features of SATORI is its ability to identify interactions between sequence motifs, contributing to its interpretability. It uses integrated gradients to calculate attribution scores for motifs in a sequence. Changes in these scores after motif mutation can suggest potential interactions. In benchmarking experiments, SATORI demonstrated superior detection rates of experimentally validated transcription factor interactions compared to existing methods without necessitating computationally expensive post-processing.

Lin and Ou-Yang [ 124 ] proposed DeepMCL, a model leveraging multi-view contrastive learning to infer GRNs from multiple data sources or time points. DeepMCL represented each gene pair as a set of histogram images and introduced a deep Siamese convolutional neural network with contrastive loss, a loss function commonly used in unsupervised or self-supervised learning tasks that encourages similar samples to be closer in the embedding space while pushing dissimilar samples farther apart; this allows the low-dimensional embedding for each gene pair to be obtained. Moreover, an attention mechanism was employed to integrate the embeddings extracted from different data sources and neighbor gene pairs.

Similarly, Xu et al. [ 125 ] presented STGRNS, an interpretable transformer-based method for inferring GRNs from scRNA-seq data. The method leveraged the gene expression motif technique to convert gene pairs into contiguous sub-vectors, which then served as the input for the transformer encoder. Furthermore, Feng et al. [ 126 ] introduced scGAEGAT, a multi-modal model integrating graph autoencoders and graph attention networks for single-cell RNA-seq analysis, exhibiting a promising performance in gene imputation and cell clustering prediction.

Xie et al. [ 128 ] proposed MVIFMDA, a multi-view information fusion method for predicting miRNA–disease associations. The model employed networks constructed from known miRNA–disease associations and miRNA and disease similarities, processed with a graph convolutional network, followed by an attention strategy to fuse topology representation and attribute representations.

The successful application of deep learning— particularly, attention mechanisms and transformer models—in GRN inference highlights its potential to enhance the precision of gene regulatory network predictions and other genetic analyses. These models have demonstrated superior performance and interpretability, outperforming conventional methods and providing valuable insights into gene regulation and disease mechanisms.

4.8. Disease Prognosis Estimation

Deep learning models with transformer architectures and attention mechanisms have seen significant utilization in estimating disease prognosis, demonstrating their efficacy in extracting meaningful patterns from complex genomic data. Among the trailblazing studies in this area include those conducted by Lee [ 129 ], Choi and Lee [ 130 ], Dutta et al. [ 131 ], Xing et al. [ 132 ], and Meng et al. [ 133 ].

Lee [ 129 ] introduced the Gene Attention Ensemble NETwork (GAENET), a model designed for prognosis estimation of low-grade glioma (LGG). GAENET incorporated a gene attention mechanism tailored for gene expression data, outperforming traditional methods and identifying HILS1 as the most significant prognostic gene for LGG. Similarly, Choi and Lee [ 130 ] proposed Multi-PEN, a deep learning model that utilizes multi-omics and multi-modal schemes for LGG prognosis. The model incorporated gene attention layers for each data type, such as mRNA and miRNA, to identify prognostic genes, showing robust performance compared to existing models.

The power of self-attention was highlighted by Dutta et al. [ 131 ] through their deep multi-modal model, DeePROG, designed to forecast the prognosis of disease-affected genes from heterogeneous omics data. DeePROG outperformed baseline models in extracting valuable features from each modality and leveraging the prognosis of the biomedical data. On the other hand, Xing et al. [ 132 ] developed MLA-GNN, a multi-level attention graph neural network for disease diagnosis and prognosis. Their model formatted omics data into co-expression graphs and constructed multi-level graph features, achieving exceptional performance on transcriptomic data from The Cancer Genome Atlas datasets (TCGA-LGG/TCGA-GBM) and proteomic data from COVID-19/non-COVID-19 patient sera.

In a distinct but related context, Meng et al. [ 133 ] introduced a novel framework called SAVAE-Cox for survival analysis of high-dimensional transcriptome data. The model incorporated a novel attention mechanism and fully leveraged an adversarial transfer learning strategy, outperforming state-of-the-art survival analysis models on the concordance index. Feng et al. [ 134 ] applied a deep learning model with an attention mechanism. The classifier could accurately predict survivals, with area under the receiver operating characteristic (ROC) curves and time-dependent ROCs reaching 0.968 and 0.974 in the training set, respectively.

Taken together, these studies collectively highlight the potential of attention mechanisms in improving disease prognosis estimation, heralding a new paradigm in analyzing genomic data for prognostic purposes. Their efficacy across a range of disease types and data modalities signifies a promising avenue for future research in precision medicine.

4.9. Gene Expression-Based Classification

The implementation of deep learning models with transformer architectures and attention mechanisms has significantly improved the classification accuracy based on gene expressions, as presented in numerous studies by Gokhale et al. [ 135 ], Beykikhoshk et al. [ 136 ], Manica et al. [ 137 ], and Lee et al. [ 138 ].

Gokhale et al. [ 135 ] put forth GeneViT, a vision transformer method, which is a deep learning architecture that applies the principles of self-attention and transformer models to visual data for classifying cancerous gene expressions. This innovative approach started with a dimensionality reduction step using a stacked autoencoder, followed by an improved DeepInsight algorithm, which is a method to transform non-image data to be used for convolution neural network architectures, achieving a remarkable performance edge over existing methodologies, as observed from evaluations on ten benchmark datasets.

Similarly, in the quest to improve breast cancer subtype classification, Beykikhoshk et al. [ 136 ] introduced DeepTRIAGE. This deep learning architecture adopted an attention mechanism to derive personalized biomarker scores, thereby allocating each patient with interpretable and individualized biomarker scores. Remarkably, DeepTRIAGE uncovered a significant association between the heterogeneity within luminal A biomarker scores and tumor stage.

In a different application, Manica et al. [ 137 ] crafted a novel architecture for the interpretable prediction of anti-cancer compound sensitivity. This model utilized a multi-modal attention-based convolutional encoder and managed to outstrip both a baseline model trained on Morgan fingerprints, a type of molecular fingerprinting technique used in chemoinformatics to encode structural information of molecules, and a selection of encoders based on the Simplified Molecular Input Line Entry System (SMILES), along with previously reported state-of-the-art methodologies for multi-modal drug sensitivity prediction.

Lee et al. [ 138 ] developed an innovative pathway-based deep learning model with an attention mechanism and network propagation for cancer subtype classification. The model incorporated graph convolutional networks to represent each pathway and a multi-attention-based ensemble model was used to amalgamate hundreds of pathways. The model demonstrated high classification accuracy in experiments with five TCGA cancer datasets and revealed subtype-specific pathways and biological functions, providing profound insights into the biological mechanisms underlying different cancer subtypes.

These studies highlight the effectiveness and innovative applications of attention mechanisms in genomic data analysis, offering new insights in precision medicine and oncology.

4.10. Proteomics

The utilization of deep learning, particularly the incorporation of transformer architectures and attention mechanisms in proteomics, has led to groundbreaking developments in the prediction of protein functionality, as depicted in the studies by Hou et al. [ 139 ], Gong et al. [ 140 ], Armenteros et al. [ 141 ], and Littmann et al. [ 142 ].

Hou et al. [ 139 ] constructed iDeepSubMito, a deep neural network model designed for the prediction of protein submitochondrial localization. This model employed an inventive graph embedding layer that assimilated interactome data as prior information for prediction. Additionally, an attention layer was incorporated for the integration of various omics features while considering their interactions. The effectiveness of this model was validated by its outperformance of other computational methods during cross-validation on two datasets containing proteins from four mitochondrial compartments.

Meanwhile, Gong et al. [ 140 ] proposed an algorithm, iDRO, aimed at optimizing mRNA sequences based on given amino acid sequences of target proteins. Their algorithm involved a two-step process consisting of open reading frame (ORF) optimization and untranslated region (UTR) generation. The former step used BiLSTM-CRF for determining the codon for each amino acid, while the latter step involved RNA-Bart for outputting the corresponding UTR. The optimized sequences of exogenous genes adopted the pattern of human endogenous gene sequences, and the mRNA sequences optimized by their method exhibited higher protein expression compared to traditional methods.

Armenteros et al. [ 141 ] showcased TargetP 2.0, a state-of-the-art machine learning model that identifies N-terminal sorting signals in peptides using deep learning. Their model emphasized the second residue’s significant role in protein classification, revealing unique distribution patterns among different groups of proteins and targeting peptides.

Littmann et al. [ 142 ] introduced bindEmbed21, a method predicting protein residues binding to metal ions, nucleic acids, or small molecules. This model leveraged embeddings from the transformer-based protein Language Model ProtT5, outperforming MSA-based predictions using single sequences. Homology-based inference further improved performance, and the method found binding residues in over 42% of all human proteins not previously implied in binding. These studies demonstrate the significant potential of transformer architectures and attention mechanisms in deep learning models for precise protein functionality prediction.

4.11. Cell-Type Identification

In recent studies, the application of transformer architectures and attention mechanisms in deep learning has brought significant progress to cell-type identification, demonstrating superior performance across various cell types, species, and sequencing depths. The application of transformer architectures and attention mechanisms in deep learning for cell-type identification has seen significant advancements, as evidenced in the studies by Song et al. [ 143 ], Feng et al. [ 144 ], Buterez et al. [ 145 ], and Zhang et al. [ 146 ].

Song et al. [ 143 ] developed TransCluster, a hybrid network structure that leverages linear discriminant analysis and a modified transformer for enhancing feature learning in single-cell transcriptomic maps. This method outperformed known techniques on various cell datasets from different human tissues, demonstrating high accuracy and robustness.

Feng et al. [ 144 ] proposed a directed graph neural network model named scDGAE for single-cell RNA-seq data analysis. By employing graph autoencoders and graph attention networks, scDGAE retained the connection properties of the directed graph and broadened the receptive field of the convolution operation. This model excelled in gene imputation and cell clustering prediction on four scRNA-seq datasets with gold-standard cell labels.

Furthermore, Buterez et al. [ 145 ] introduced CellVGAE, a workflow for unsupervised scRNA-seq analysis utilizing graph attention networks. This variational graph autoencoder architecture operated directly on cell connectivity for dimensionality reduction and clustering. Outperforming both neural and non-neural techniques, CellVGAE provided interpretability by analyzing graph attention coefficients, capturing pseudotime and NF-kappa B activation dynamics.

Zhang et al. [ 146 ] showcased RefHiC, an attention-based deep learning framework for annotating topological structures from Hi-C, which is a genomic technique that measures the three-dimensional spatial organization of chromatin within the nucleus. Utilizing a reference panel of Hi-C datasets, RefHiC demonstrated superior performance across different cell types, species, and sequencing depths.

4.12. Predicting Drug–Drug Interactions

Recent studies have showcased the remarkable progress in predicting drug–drug interactions (DDIs) through the use of deep learning models incorporating transformer architecturesmand attention mechanisms, surpassing classical and other deep learning methods while highlighting significant drug substructures. Deep learning with transformer architectures and attention mechanisms has significantly advanced the prediction of DDIs. Schwarz et al. [ 147 ] introduced AttentionDDI, a Siamese self-attention multi-modal neural network that integrates various drug similarity measures derived from drug characteristics. It demonstrated competitive performance compared to state-of-the-art DDI models on multiple benchmark datasets. Similarly, Kim et al. [ 148 ] developed DeSIDE-DDI, a framework that incorporates drug-induced gene expression signatures for DDI prediction. This model excelled with an AUC of 0.889 and an Area Under the Precision–Recall (AUPR) of 0.915, surpassing other leading methods in unseen interaction prediction.

Furthermore, Liu and Xie [ 149 ] proposed TranSynergy, a knowledge-enabled and self-attention transformer-boosted model for predicting synergistic drug combinations. TranSynergy outperformed existing methods and revealed new pathways associated with these combinations, providing fresh insights for precision medicine and anti-cancer therapies. Wang et al. [ 150 ] also developed a deep learning model, DeepDDS, for identifying effective drug combinations for specific cancer cells. It surpassed classical machine learning methods and other deep-learning-based methods, highlighting significant chemical substructures of drugs. Together, these studies highlight the utility of transformer architectures and attention mechanisms in predicting drug–drug interactions, paving the way for further advancements in the field.

4.13. Other Topics

Transformer architectures and attention mechanisms have found applications in various genomic research topics, highlighting the versatility of transformer architectures and attention mechanisms in genomics research. For instance, Yu et al. [ 151 ] developed IDMIL-III, an imbalanced deep multi-instance learning approach, which excellently predicts genome-wide isoform-isoform interactions, and Yamaguchi and Saito [ 152 ] enhanced transformer-based variant effect prediction by proposing domain architecture (DA)-aware evolutionary fine-tuning protocols, which are computational methods that leverage evolutionary algorithms and consider the structural characteristics of protein domains to optimize and refine protein sequence alignments.

On the other hand, Zhou et al. [ 153 ] combined convolutional neural networks with transformers in a deep learning model, INTERACT, to predict the effects of genetic variations on DNA methylation levels. Cao et al. [ 154 ] presented DeepASmRNA, an attention-based convolutional neural network model, showing promising results for predicting alternative splicing events.

Gupta and Shankar [ 155 ] innovatively proposed miWords, a system that treats the genome as sentences composed of words, to identify pre-miRNA regions across plant genomes, achieving an impressive accuracy of 98%. Concurrently, Zhang et al. [ 156 ] developed iLoc-miRNA, a deep learning model employing BiLSTM with multi-head self-attention for predicting the location of miRNAs in cells, showing high selectivity for extracellular miRNAs.

Choi and Chae [ 157 ] introduced moBRCA-net, a breast cancer subtype classification framework, which significantly improved performance by integrating multiple omics datasets. These studies showcase the versatility and potential of transformer architectures and attention mechanisms in diverse genomic research contexts.

5. Discussion

In consideration of the existing literature, it is evident that deep learning models employing transformer architectures and attention mechanisms have shown promising results in analyzing genome data. However, challenges persist, and opportunities for future work are manifold.

5.1. Challenges

One of the principal challenges inherent in applying deep learning models to genomic data pertains to the complex structure of these data. Specifically, gene expression data are typically represented as high-dimensional vectors due to the number of genes captured in each sample during the high-throughput sequencing. This representation poses a challenge for conventional data analysis and interpretation methods. Although some studies, such as those by Lee et al. [ 70 ] and Chen et al. [ 71 ], have made strides in this aspect by proposing novel model architectures or preprocessing techniques, the high-dimensional nature of genomic data remains a challenge.

Another significant challenge is the limited availability of labeled data. In many tasks such as predicting lncRNA–disease associations or circRNA–RBP interactions, the amount of experimentally confirmed positive and negative associations is often insufficient for training deep learning models [ 83 , 86 ]. This can lead to models that are biased towards the majority class and, therefore, provide poor performance on the minority class.

The inherent complexity of biological systems also poses significant challenges. For instance, gene–gene and gene–environment interactions are complex and often non-linear, making them challenging to capture with standard deep learning models [ 72 , 74 ]. Furthermore, genomic and epigenomic data are often heterogeneous, consisting of diverse data types such as sequence data, gene expression data, and histone modification data. Integrating these diverse data types in a unified model can be challenging.

5.2. Future Work

One promising direction for future work is to develop novel model architectures that can effectively handle the high-dimensional nature of genomic data. This could involve designing models that can automatically extract relevant features from the data or leveraging techniques such as dimensionality reduction or feature selection. Moreover, the incorporation of biological prior knowledge into the design of these models could help guide the feature extraction process and lead to more interpretable models.

There is also a need for methods that can effectively deal with the limited availability of labeled data in genomics. One promising approach is to leverage unsupervised or semi-supervised learning techniques, which can make use of unlabeled data to improve model performance [ 158 , 159 , 160 ]. Transfer learning, where a model trained on a large dataset is fine-tuned on a smaller, task-specific dataset, could also be a promising approach for dealing with the scarcity of labeled data [ 161 , 162 , 163 ].

Addressing the complexity of biological systems could involve developing models that can capture the intricate interactions and non-linear relationships that are typical in biological systems. These models would need to be able to accommodate the heterogeneity of genomic and epigenomic data. Recent work by Kang et al. [ 72 ] and Liao et al. [ 73 ] points to the potential of multi-modal deep learning models in this regard. Further research is needed to develop and refine such models for various tasks in genomics.

Also, the incorporation of domain knowledge into the models could be another promising approach. By incorporating known biological mechanisms or relationships into the models, we could guide the learning process and make the learned representations more interpretable.

Finally, the emergence of transformer-based models, such as the GPT families, provides an exciting opportunity for future work. These models have shown great promise in natural language processing, and their ability to model long-range dependencies, where distant genomic elements often interact with each other, could be highly beneficial in genomics. Therefore, adapting and applying these transformer-based models to genomic data is a promising direction for future work.

6. Conclusions

In the rapidly advancing landscape of bioinformatics, the need for a comprehensive synthesis of the most recent developments and methodologies is essential. This review aims to provide an extensive examination of the transformative use of deep learning, specifically transformer architectures and attention mechanisms, in the analysis of protein–protein interactions. The swift evolution of these computational strategies has significantly enhanced our capacity to process and decipher complex genomic data, marking a new epoch in the field.

The analysis presented herein, drawn from the most recent studies from 2019 to 2023, emphasizes the astounding versatility and superior performance of these deep learning techniques in a multitude of applications. From sequence and site prediction and gene expression and phenotype prediction, to the more complex multi-omics tasks and disease prognosis estimation, deep learning techniques have proven their potential in elucidating hidden patterns and relationships within genomic sequences. Furthermore, the application of transformer architectures and attention mechanisms has not only expedited computations but also improved accuracy and interpretability, ultimately driving the field forward.

Despite the remarkable advancements and successes recorded, it is important to note that the integration of deep learning in genome data analysis is still in its infancy. There remain several challenges and limitations to be addressed, particularly in improving the interpretability of these models and adapting them for use with smaller datasets, often encountered in the domain of genomics. Moreover, with the ever-growing complexity and scale of genomic data, there is a constant demand for even more advanced and efficient computational tools.

Through this review, we hope to provide a platform for researchers to engage with the latest advancements, familiarize themselves with the state-of-the-art applications, and identify potential gaps and opportunities for future exploration. This synthesis, encompassing a wide array of research topics and applications, demonstrates the immense potential and broad applicability of deep learning techniques in bioinformatics.

The integration of deep learning methodologies, particularly transformer architectures and attention mechanisms, into the bioinformatics toolkit has greatly facilitated our understanding of the ’language of biology’. These powerful computational techniques have proven to be an invaluable asset in unraveling the mysteries encoded within genomic sequences. As this research frontier continues to expand and evolve, we anticipate that the insights provided by this review will spur continued innovation and exploration, propelling us towards new discoveries in the dynamic world of genome data analysis.

Funding Statement

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00251528).

Author Contributions

Conceptualization, S.R.C. and M.L.; formal analysis, S.R.C.; investigation, S.R.C.; writing—original draft preparation, S.R.C. and M.L.; writing—review and editing, S.R.C. and M.L.; visualization, S.R.C. and M.L.; supervision, M.L. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Data availability statement, conflicts of interest.

The authors claim no conflict of interest.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Advertisement

Advertisement

Transformer Models in Healthcare: A Survey and Thematic Analysis of Potentials, Shortcomings and Risks

  • Open access
  • Published: 17 February 2024
  • Volume 48 , article number  23 , ( 2024 )

Cite this article

You have full access to this open access article

  • Kerstin Denecke 1 ,
  • Richard May 2 &
  • Octavio Rivera-Romero 3 , 4  

1198 Accesses

Explore all metrics

Large Language Models (LLMs) such as General Pretrained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT), which use transformer model architectures, have significantly advanced artificial intelligence and natural language processing. Recognized for their ability to capture associative relationships between words based on shared context, these models are poised to transform healthcare by improving diagnostic accuracy, tailoring treatment plans, and predicting patient outcomes. However, there are multiple risks and potentially unintended consequences associated with their use in healthcare applications. This study, conducted with 28 participants using a qualitative approach, explores the benefits, shortcomings, and risks of using transformer models in healthcare. It analyses responses to seven open-ended questions using a simplified thematic analysis. Our research reveals seven benefits, including improved operational efficiency, optimized processes and refined clinical documentation. Despite these benefits, there are significant concerns about the introduction of bias, auditability issues and privacy risks. Challenges include the need for specialized expertise, the emergence of ethical dilemmas and the potential reduction in the human element of patient care. For the medical profession, risks include the impact on employment, changes in the patient-doctor dynamic, and the need for extensive training in both system operation and data interpretation.

Similar content being viewed by others

transformer model research paper

The imperative for regulatory oversight of large language models (or generative AI) in healthcare

Bertalan Meskó & Eric J. Topol

transformer model research paper

Advancing medical affair capabilities and insight generation through machine learning techniques

Karen Ka Yan Ng & Peter Chengming Zhang

transformer model research paper

Assessing the research landscape and clinical utility of large language models: a scoping review

Ye-Jean Park, Abhinav Pillai, … Christopher Naugler

Avoid common mistakes on your manuscript.

Introduction

Rapid advances in artificial intelligence (AI) technologies, including large language models (LLMs) and generative AI, have created new opportunities and challenges for healthcare. An LLM is a machine learning model that encodes complex patterns of language usage derived from large amounts of input text. LLMs can use neural network architectures, typically enhanced with a transformer attention mechanism that capture associative relationships between words based on shared context. These transformer models were first introduced in 2017 by Vaswani et al. [ 1 ] and have already significantly changed the landscape of natural language processing (NLP). Originally developed for language-related applications, transformer models, e.g. Bidirectional Encoder Representations from Transformers (BERT) or Generative Pre-trained Transformer (GPT), have shown remarkable capabilities in understanding and generating human language. They have proven highly successful in NLP for tasks such as machine translation [ 2 , 3 ], document summarization [ 4 ], document classification [ 5 ] and named entity recognition [ 6 ] or medical question answering [ 7 ].

In previous work, we identified eight categories of use cases of transformer models. They include documentation and clinical coding, workflow and healthcare services, knowledge management, interaction support, patient education, health management, public health monitoring, and decision support [ 8 ]. Mesko discussed hypothetical future scenarios for LLMs, including remote patient diagnosis and surgical training. He highlighted the potential benefits of multimodal LLMs, such as processing different content types, overcoming language barriers, supporting interoperability in hospitals, analyzing scientific data with sentiment and context awareness, and supporting privacy protection [ 9 ]. Li et al. introduced a transformer-based algorithm that predicts the likelihood of conditions in a patient’s future visit to a hospital based on data from the electronic health record [ 10 ]. Overall, transformer models have shown significant performance gains in medical problem summarization [ 11 ] and clinical coding [ 12 ].

In view of possible use cases and encouraging results from research, it is of high relevance to reflect in this early stage of the era of applying transformer models in healthcare on their potentials, risks and shortcomings. Such reflection is necessary for a responsible design of applications. It will help in developing sustainable and efficient solutions that make use of this technology and truly improve healthcare outcomes by minimizing the risks. The research objective of this paper is therefore to identify the potentials, shortcomings and risks associated with the use of transformer models in healthcare by conducting a qualitative study with 28 participants. Additionally, we aim to assess what is needed for considering applications based on such models reliable. This knowledge will help in developing solutions that will be accepted by their users. Furthermore, the results will enable us to establish a research agenda for the development of applications based on transformer models. To the best of our knowledge, this is the first study to explore the opinions of researchers in the field of health NLP on the use of transformer models in the health sector. We are aware of research papers envisioning the future landscape of LLMs in medicine [ 9 , 13 ]. However, these papers only basically summarize ideas of their authors while we focus on conducting an online survey and a qualitative analysis and base our results on a broader expert basis. Other papers assessed the potentials and risks of ChatGPT as a health application in an experimental manner [ 14 , 15 ]. We are focusing not on this commercial product that has not specifically developed for healthcare purposes, but on the potentials and risks of applying the technology in tailored applications.

To achieve our goal, we conducted an online survey with qualitative analysis. It was distributed among researchers working in the field of NLP in healthcare. They were recruited via email from the IMIA Participatory Health and Social Media Working Group, the authors’ peer networks, or by contacting researchers who were listed as corresponding authors in papers on transformer models in healthcare. Participants were given a brief definition of transformer models to ensure that all considered the same definition and were imagining not only the currently popular OpenAI’s ChatGPT but also the underlying technology. The questionnaire included a series of demographic questions and 7 open-ended questions: (1) What are the benefits of transformer models in healthcare? (2) Which shortcomings of applying transformer models in healthcare do you see? Which risks do you see for the (3) medical profession, (4) patient care, (5) health IT, (6) data protection in regard to the adoption of transformer-based models in health IT?, (7) When would you consider digital solutions based on transformer models to be reliable?

The questionnaire was open for three weeks from 10 April to 1 May 2023. No reminders were sent. All responses to the open-ended questions were analyzed by the authors using a simplified thematic analysis [ 16 ]. After the survey was administered, two authors (KD, OR) independently read the responses, familiarized themselves with them and grouped the responses into categories. Categories were checked for consistency and simplicity (themes included all coded factors (inclusive) and two categories could not be assigned to one response (exclusive)). Finally, suitable names and definitions were created for each category. The final groups were formed in discussion between the two authors (KD, OR). Conflicts were discussed with a third author (RM). To report the results of the survey, considering size restrictions, we followed the Checklist for Reporting Results of Internet E-Surveys (CHERRIES) [ 17 ] and Consolidated criteria for Reporting Qualitative research (COREQ) checklist for qualitative studies [ 18 ]. A clarification of responsibility was submitted to the ethics committee of Cantone Berne who confirmed that no ethics approval is necessary for conducting the study as described before.

In this section, we summarize the demographics of the panel and the results of the thematic analysis. Quotes undermining the identified themes are available in Appendix 1.

Delphi Participant Panel

The panel consisted of 28 researchers (25% female, n  = 7). An exact response rate cannot be provided as we allowed the recruited participants to share the link to the survey with their network. Our estimated response rate is 26.4% since we directly contacted 44 persons and the IMIA Working group mailing list comprises 78 e-mail addresses. Basic demographics are summarized in Table  1 . A total of 10.7% reported being experts in transformer models, 25% used their basic functions regularly, 28.6% knew how they work, and 32.1% tested OpenAI ChatGPT but had only basic knowledge of the underlying technology. One person had no knowledge of transformer models - we excluded this person’s response for reasons of validity.

Benefits of Transformer Models in Healthcare

Seven themes were identified among the participants’ responses to the question regarding the potential of applying transformer models in healthcare applications (see Fig.  1 ):

A1: Increased efficiency and optimization of healthcare.

Transformer models can improve healthcare efficiency by accelerating diagnoses and automating tasks like triage, appointment scheduling, and clinical trial matching. This automation helps reallocate human resources to critical tasks, reducing their burden and workload.

A2: Quality improvement in documentation tasks.

Transformer models can improve clinical documentation by summarizing large amounts of information and tailoring the writing style for different readers, reducing the burden on healthcare professionals and improving documentation quality.

A3: Improvement of clinical communication.

Transformer models can improve clinical communication between health professionals and with patients by reducing errors and tailoring information to the language, cultural level or age of the recipient. They could also facilitate the collection of information from patients at a distance during initial contact or follow-up.

A4: Enhanced and improved clinical procedures.

Transformer models could improve healthcare processes through evidence-based decision making, accurate diagnoses through automated data analysis and prediction (e.g. “help in identifying patterns and predicting outcomes in healthcare data”), and automated generation of treatment plans (e.g. ”develop more effective treatment plans”).

A5: Provision of personalized care.

Automatic data analysis using advanced algorithms enables the implementation of personalized medicine. In this regard, some participants pointed out that treatment and diagnosis can become personalized and preventive by transformer model-based systems.

A6: Improved access to data and knowledge.

Transformer models improve data access and processing for better knowledge creation, efficiently extracting relevant information from large, unstructured healthcare data. They also enable easier human-computer interactions, such as voice user interfaces to access information and knowledge.

A7: Increased individuals’ empowerment.

Transformer models in healthcare will empower individuals, patients, carers as well as health professionals, by supporting them through information provision and enhancing their knowledge as needed.

figure 1

Identified benefits and shortcomings of the use of transformer models in healthcare

Shortcomings of Transformer Models in Healthcare

Six themes were identified among participants’ responses to the question regarding the potential shortcomings of the use of transformer models in healthcare (see Fig.  1 ):

B1: Quality of the transformer model-based systems.

This theme comprises two subthemes: system development aspects and erroneous system results. System development issues arise from data dependency, as the quality of transformer models is affected by biases in the training data, such as race and gender bias. Participants noted the need for high-quality, annotated data for training purposes, which is limited due to high annotation costs. The second subtheme, erroneous system results, involves risks from incorrect information provided by transformer models. Challenges include verifying information, dealing with errors or hallucinations and the lack of explainability and interpretability. These issues could harm patients and reduce health professionals’ trust and acceptance of these models. Participants emphasized the importance of testing transformer models in healthcare and real-world scenarios to ensure reliability.

B2: Compliance with regulations, data privacy and security.

Transformer model-based systems must comply with privacy regulations and protect the privacy of sensitive health data, particularly from potential third-party access and misuse.

B3: Human factors.

This theme relates to the health professionals who are expected to use systems based on transformer models. Issues include the need for human expertise to judge the results and their accuracy, overreliance, carelessness and the underdevelopment of skills.

B4: Reduced integration into healthcare.

The theme concerns the reduced integration of transformer model-based systems into healthcare workflows and challenges related to their uptake and use. Participants identified the increased complexity of care caused by the proliferation of information, including that generated by transformer model-based systems, as a key challenge to adoption and use by healthcare professionals.

B5: Ethical concerns.

Biased training data could exacerbate health inequalities, and the need for technical resources and professional training, which is not uniformly available across health centers, could further contribute to inequalities.

B6: De-humanization of care.

Transformer models could affect the doctor-patient relationship by reducing interaction and increasing de-humanization. The automation of care processes could also make patients feel treated as numbers.

figure 2

Identified risks of the use of transformer models in health

Risks Associated with the Use of Transformer Models in Healthcare

We asked the participants to reflect on the risks of the use of transformer models in healthcare from different perspectives: risks for patient care, for the medical profession, for health IT and for data protection. The results are summarized in the following.

Risks for PatientCcare

We identified six categories of risks for patient care associated with the usage of transformer models in healthcare applications (see Fig.  2 ):

C1: Untrusted, inaccurate or biased information.

When used to provide clinical decision support, transformer models may lack accuracy or require verification, leading to the risk of misdiagnosis or incorrect treatment. The increasing availability of such models could lead to the use of unreliable or untested systems by health professionals, patients or carers, potentially causing harm.

C2: Misuse of transformer model-based systems.

A major concern was over-reliance on these systems by both patients and professionals, potentially undermining patients’ self-management and decision-making skills in the care process. To mitigate this, participants emphasized the need for patient education on responsible use and correct interpretation of results from transformer model-based systems.

C3: Impact on the patient-doctor relationship.

The patient-doctor relationship, normally based on trust, empathy, respect and continuity, could be compromised by overreliance on diagnoses or treatment suggestions from digital systems. Some participants noted that the excessive focus on these digital technologies by healthcare professionals could lead to worsen interpersonal relationship with patients. Patients could negatively perceive this overreliance because they could feel that digital solutions are replacing doctors resulting in a de-humanization of the healthcare. One participant commented that this deterioration in relationships could even extend to the institutions, leading to patients underestimating and distrusting the healthcare system.

C4: Liability in case of errors and misuse.

The issue of liability is a major concern in relation to the risk of misdiagnosis and mistreatment. In cases where systems malfunction or fail, determining responsibility remains an unresolved challenge.

C5: Bias and inequity.

Systems based on transformer models, which are often trained on biased data, could exacerbate health inequalities. Factors such as low literacy, accessibility issues and socio-economic status provide barriers to patient use of these solutions.

C6: Data privacy and security.

Participants identified privacy and security risks in patient care (e.g. data breaches or unauthorized access to data) and emphasized that personal health information, especially sensitive data, is protected by law and is essential for a trusting patient-doctor relationship. They agreed that the processing of patient data by transformer model-based systems could lead to violations of patient rights.

Risks for the Medical Profession

We identified several risks for the medical profession (see Fig.  2 ):

D1: Need for training on new competences, and loss of skills.

This category concerns overconfidence, overreliance, undervaluation, the need for specific education and training for health professionals, and the erosion of clinical skills and confidence in quality. Participants stressed the importance of training professionals to understand and correctly use and interpret the results of these systems, not to overrely or undervalue their results, and highlighted concerns about confidence in their quality and effectiveness. Health professionals need to learn when to trust the system versus their own expertise. Finally, there is concern that reliance on these systems could undermine critical thinking skills.

D2: Impact on the patient-doctor relationship.

The negative impact on the patient-doctor relationship is a key issue regarding the risks of using transformer models in medicine. Participants agreed that these systems could reduce patient-doctor communication, potentially leading to a loss of patient trust and weakening the patient-doctor relationship.

D3: Unintended consequences.

The use of transformer models in healthcare can lead to unintended consequences, such as incorrect diagnoses and inappropriate treatment plans, often due to incorrect model outputs or an overestimation of the models’ capabilities.

D4: Legal, liability and ethical concerns.

Participants identified and discussed potential legal and ethical issues in the use of transformer models in healthcare, including privacy, data security and patient autonomy. Concerns were also raised about the liability of healthcare professionals for errors or misuse of these systems.

D5: Impact on jobs.

The introduction of transformer models in healthcare could have an impact on jobs: creating new roles, changing existing roles and possibly leading to job losses in medical professions.

Risks for Health IT

In the following, the identified risks for health IT are described (see Fig.  2 ).

E1: Need for resources to develop and integrate transformer models in healthcare systems.

Participants highlighted the need for multiple resources to develop, deploy, integrate and maintain transformer models in healthcare. They found the integration of these systems into existing health IT infrastructures to be particularly challenging. Concerns included development, integration and operational costs, which could exacerbate inequalities due to financial constraints in healthcare institutions. Lack of reimbursement models and time constraints were also significant factors. The need for specialized human resources and expert development of these systems was emphasized, and the risk of their unavailability was noted. In addition, specific training was considered essential for the effective uptake and use of transformer model-based systems.

E2: Complex regulatory situation and legal issues.

Complex regulations in different countries, such as medical device regulations, General Data Protection Regulation (GDPR) and Health Insurance Portability and Accountability Act (HIPAA), already pose risks to the health IT sector and even more regulation is needed. The adoption of transformer models in health IT raises issues around intellectual property, patents and licensing, potentially hindering collaboration, knowledge sharing and industry adoption, and increasing the risk of litigation. Despite their potential to advance medical research, diagnosis and treatment, challenges remain in the ownership and licensing of these models. In addition, determining liability and responsibility for misdiagnosis and mistreatment due to incorrect system outputs remains a pressing issue.

E3: Quality of solutions.

Participants identified quality issues related to transformer models, including the quality of information, data, models, validation and evaluation. They emphasized the importance of the quality of system results, noting that inaccurate, inappropriate or confusing information could lead to unintended consequences. The quality of systems was linked to training data, with concerns about the use of models outside their training context. Despite recognizing the need for high quality systems to prevent patient harm, participants found it challenging to evaluate and validate transformer models due to the lack of standardized evaluation frameworks. They also noted that competitive pressures to develop and market new tools could compromise system quality.

E4: Data privacy and security.

Transformer models handle large amounts of sensitive data, which contributes to associated security and cybersecurity risks.

E5: Ethical aspects.

Participants reported ethical concerns related to the use, development, and training of transformer models as important factors to consider.

Risks for Data Protection

Participants’ answers to the question on risks related to data protection resulted in three categories of topics (see Fig.  2 ):

F1: Unauthorized exposure of data.

The use of transformer models in healthcare could lead to confidentiality issues, including unauthorized data disclosure, breaches of privacy regulations, data leakage, and insecure data storage and transmission.

F2: De-identification and anonymization.

Participants raised concerns about de-identification and anonymization in transformer models, noting the risk of exposing sensitive data and the use of weak anonymization techniques that reduce their trustworthiness.

F3: Data governance.

There are risks of lack in transparency and a need for clear descriptions of how transformer model-based systems handle patient data. Concerns have also been raised about inadvertent disclosure of medical data to third parties during development, which poses privacy and security risks.

Reliability of Health Systems Based upon Transformer Models

The free text answers to the question “When would you consider digital solutions based on transformer models to be reliable?” revealed three groups of aspects:

G1: Supervised and transparent use.

Participants emphasized that the reliability of transformer model-based systems can increase when a human is involved. The ability to interpret and repeat results is key to reliability. The systems should explain how the model arrived at its results. Their use should be made transparent to patients.

G2: Data integrity and generalizability.

Data quality, particularly in terms of diversity and representativeness of the target population and health context, was considered critical for reliability. Participants also identified generalisability as a key factor in the real-world applicability of transformer models.

G3: System quality.

This theme covers aspects such as output, outcome, model quality, regulatory compliance, accuracy, efficiency, effectiveness, robustness, resilience, bias minimization and fairness. Key issues include compliance with security and privacy regulations, accuracy through validation and testing, and the importance of effectiveness and efficiency for reliability. Robustness and resilience of models are seen as critical, and minimizing bias and ensuring fairness are also essential for system reliability.

Principal Results

This study examined opinions of researchers in the field of NLP in healthcare on the benefits, shortcomings and risks of applying transformer models in healthcare. Benefits include increased efficiency, process optimization, improved clinical documentation, better communication, automation of routine tasks and better decision making, as well as better data handling and patient empowerment. However, there are concerns about potential bias, auditability and privacy. Challenges include the need for expertise, ethical dilemmas and potential de-humanization of care. Specific risks for the medical profession include the impact on jobs, changes in the patient-doctor relationship, and the need for training in system use and data interpretation, with an anticipated loss of skills for both health professionals and patients.

Relation to Other Work

Studies of NLP tasks using transformer models are consistent with participants’ views of potential improvements in documentation tasks. These models have shown promise in areas such as radiation oncology [ 19 ], medical problem summarization [ 11 ] and clinical coding [ 12 ], and offer potential for text summarization, efficient writing and multilingual communication [ 20 ]. This potential related to a positive impact on efficiency and optimization of healthcare tasks are supported by Thirunavikarasu et al., who concluded that “studies are needed to ensure that LLM tools actually reduce workload rather than introducing an even greater administrative burden for healthcare” [ 21 ]. Given the early stage of development of digital health solutions based on transformer models, there is little evidence from studies to show the efficiency gains achieved by such solutions. However, there are significant concerns about misinformation from LLMs, as highlighted by participants and researchers such as Eggmann et al. [ 20 ] and De Angelis et al. [ 22 ].

Re-identification was considered a significant risk by participants. However, they did not define potential differences among several contexts such as rare conditions. Shortcomings such as model quality, privacy, security, ethical issues and human factors are also recognized in the literature [ 23 ]. Reddy et al. proposed an evaluation framework for the application of LLMs in healthcare to address these risks [ 24 ].

We found dependencies between different aspects, such as system errors and liability. If transformer models produce wrong information and cause (wrong or unnecessary) patient treatment, this not only poses risks to patient care but also raises liability concerns and would have an economic impact. We argue that the “human in the loop” approach offers a valuable layer of supervision and verification that serves as a key link to mitigate these concerns. Ahmed et al. also argue for human involvement to validate the results of LLM-based systems and prevent patient harm [ 25 ].

Legal regulations, such as GDPR and HIPAA or ISO/IEC 27,000 series are of major importance to ensure the responsible use of applications in healthcare. Mesko and Topol argue in favor of a regulatory oversight that should assure medical professionals and patients can use transformer-model-based systems without causing harm or compromising their data or privacy [ 26 ]. Their practical recommendations include creating a “new regulatory category for LLMs as those are distinctively different from AI-based medical technologies that have gone through regulation already”. However, it is also worth discussing the balance between regulation and innovation. Finding a proper balance is important (albeit highly complex) to promote the adequate development and deployment of new technologies while maintaining the trust and privacy of patients. To avoid hampering innovation we recommend a responsible design and development, that includes reflections of possible risks in the early stages of solution design. Several tools supporting this issue have been developed recently, e.g., the risk assessment canvas for digital therapeutics [ 27 ] or the digital ethics canvas [ 28 ]. In addition, Harrer proposed a comprehensive framework for the responsible design, development and use of LLM-based systems [ 29 ]. This framework focuses on ensuring fairness, accountability, privacy, transparency, accountability and alignment with values and purposes, reflecting key aspects identified in the survey. This approach emphasizes the need for careful consideration of ethical, technical and cultural issues in the development and use of LLMs in healthcare.

Additionally, efforts are underway to address biases in transformer models, as exemplified by Mittermaier et al.‘s strategy for mitigating bias in surgical AI systems [ 30 ]. These initiatives are critical to improving the accuracy and fairness of healthcare supported by transformer model-based systems [ 30 ]. The proliferation of digital health has enabled the elimination of certain barriers in healthcare by reducing disparity. However, the use of these technologies has led to the emergence of new factors affecting health equity. Despite being a highly relevant topic, participants did not mention any specific health disparity considerations. There is an urgent need for standardized evaluation frameworks, evaluation standards and metrics to ensure that these models meet essential requirements such as accuracy, effectiveness and reliability. This is in line with the work of Guo et al., who highlight that LLMs can potentially leak private data or produce inappropriate, harmful or misleading content [ 31 ]. Guo et al. acknowledged the importance of evaluating LLMs from multiple perspectives, including knowledge and skills, alignment, and security [ 31 ]. The risk of dehumanization can also be controversial: Dehumanization could have a positive impact on patient care by reducing the shame that occurs in human-to-human communication, thereby better promoting and protecting important medical values [ 32 ].

Research Agenda

As indicated at the beginning, one objective of this study was to derive a research agenda for the development of applications based on transformer models in healthcare. For successful real-world application, a comprehensive approach is necessary, including:

Responsible design: Considering ethical and other risks during development to create solutions that mitigate these issues.

Utilizing real-world data: Evaluating model quality and performance using authentic, diverse healthcare data for a realistic assessment of capabilities.

Testing and Integration: Rigorous testing and seamless integration into health IT systems and workflows to ensure practicality and effectiveness in clinical settings.

Education and training: Providing education and training for patients and health professionals to improve interaction with transformer-based systems [ 33 ].

Continuous risk assessment: Ongoing evaluation of potential risks and shortcomings during the design and development process.

Postmonitoring procedures: Implementing robust postmarketing surveillance to ensure patient safety, quality, transparency, and ethics, addressing challenges and risks over time [ 34 ].

Limitations

The study’s participants, mainly from computer science, health informatics and medicine, were predominantly affiliated with academic institutions, mainly in Europe. This skewed representation, with only a third coming from regions such as Australia/Oceania and North America, may affect the applicability of the study, especially given Europe’s established healthcare systems and strict privacy regulations. This demographic imbalance could limit the relevance of the findings in areas without similar regulatory, economic and infrastructural contexts, impacting on the adoption and use of the transformer model. In addition, while most participants had experience in health informatics, only about a third had specific experience with transformer models, mostly limited to testing OpenAI’s ChatGPT. This lack of extensive knowledge of transformer models could affect the reliability of their assessments. The selection of participants based on publication records and involvement in a working group introduced a selection bias. To reduce bias in the thematic analysis, it was conducted by two independent people.

Another limitation of this study is that the user responses were sometimes not comprehensive enough to extract sufficient detail. Therefore, some of the items listed before remain vague. For example no specific aspects were mentioned where professionals would need training (item B5). Data privacy and security issues were identified as potential risks of using LLMs in healthcare. Some examples of the potential risks were mentioned but deeper analysis should be done in further research. As it is a qualitative study with time limitations, some themes were not addressed in depth.

Conclusions

Transformer models and LLMs have the power to transform healthcare systems and processes. They offer remarkable advances in diagnosis, treatment, communication, clinical documentation and workflow management. These models contribute to personalized care, increase patient empowerment, and improve access to data and medical knowledge. However, these technologies also pose various risks and limitations, which can be broadly classified into three categories: data-related issues, system use and its impact, and system quality and regulatory concerns. From an economic perspective, there is a need to establish training programmes and a potential shift in the employment landscape within the healthcare sector.

A number of considerations are critical to the reliable application of these models:

Human-in-the-loop systems to ensure oversight and accountability.

Transparency in explaining the results of these models.

Ensuring high quality data.

Maintaining robust system quality, including reliability and accuracy.

Compliance with regulatory standards.

In summary, the integration of transformer models in healthcare offers significant potential for innovation and improvement. However, it requires a careful and multi-faceted strategy to ensure its safe and effective implementation. By following these considerations for reliable applications, we can harness the transformative power of these technologies while maintaining the highest standards of patient care and well-being in the dynamically evolving healthcare technology landscape.

Appendix 1: Quotes of the participants’ responses for the identified themes .

Data Availability

No datasets were generated or analysed during the current study.

6. References

A. Vaswani et al , ‘Attention is All you Need’, in Advances in Neural Information Processing Systems , Curran Associates, Inc., 2017. Accessed: Jun. 18, 2023. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee 91fbd053c1c4a845aa-Abstract.html

Q. Wang et al , ‘Learning Deep Transformer Models for Machine Translation’, 2019, doi: https://doi.org/10.48550/ARXIV.1906.01787 .

W. Wang, Z. Yang, Y. Gao, and H. Ney, ‘Transformer-Based Direct Hidden Markov Model for Machine Translation’, in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop , Online: Association for Computational Linguistics, 2021, pp. 23–32. doi: https://doi.org/10.18653/v1/2021.acl-srw.3 .

G. Moro, L. Ragazzi, L. Valgimigli, G. Frisoni, C. Sartori, and G. Marfia, ‘Efficient Memory-Enhanced Transformer for Long-Document Summarization in Low-Resource Regimes’, Sensors , vol. 23, no. 7, p. 3542, Mar. 2023, doi: https://doi.org/10.3390/s23073542 .

X. Dai, I. Chalkidis, S. Darkner, and D. Elliott, ‘Revisiting Transformer-based Models for Long Document Classification’. arXiv, Oct. 25, 2022. Accessed: Feb. 03, 2024. [Online]. Available: http://arxiv.org/abs/2204.06683

A. Gillioz, J. Casas, E. Mugellini, and O. A. Khaled, ‘Overview of the Transformer-based Models for NLP Tasks’, presented at the 2020 Federated Conference on Computer Science and Information Systems, Sep. 2020, pp. 179–183. doi: https://doi.org/10.15439/2020F20 .

X. Yang et al , ‘GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records’, 2022, doi: https://doi.org/10.48550/ARXIV.2203.03540 .

Article   Google Scholar  

K. Denecke, R. May, and O. Rivera Romero, ‘How Can Transformer Models Shape Future Healthcare: A Qualitative Study’, in Studies in Health Technology and Informatics , M. Giacomini, L. Stoicu-Tivadar, G. Balestra, A. Benis, S. Bonacina, A. Bottrighi, T. M. Deserno, P. Gallos, L. Lhotska, S. Marceglia, A. C. Pazos Sierra, S. Rosati, and L. Sacchi, Eds., IOS Press, 2023. doi: https://doi.org/10.3233/SHTI230736 .

B. Meskó, ‘The Impact of Multimodal Large Language Models on Health Care’s Future’, J. Med. Internet Res., vol. 25, p. e52865, Nov. 2023, doi: https://doi.org/10.2196/52865 .

Article   PubMed   PubMed Central   Google Scholar  

Y. Li et al , ‘BEHRT: Transformer for Electronic Health Records’, Sci. Rep., vol. 10, no. 1, p. 7155, Apr. 2020, doi: https://doi.org/10.1038/s41598-020-62922-y .

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Y. Gao, T. Miller, D. Xu, D. Dligach, M. M. Churpek, and M. Afshar, ‘Summarizing Patients’ Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models’, Proc. COLING Int. Conf. Comput. Linguist , vol. 2022, pp. 2979–2991, Oct. 2022.

I. Coutinho and B. Martins, ‘Transformer-based models for ICD-10 coding of death certificates with Portuguese text’, J. Biomed. Inform., vol. 136, p. 104232, Dec. 2022, doi: https://doi.org/10.1016/j.jbi.2022.104232 .

Article   PubMed   Google Scholar  

J. Clusmann et al , ‘The future landscape of large language models in medicine’, Commun. Med., vol. 3, no. 1, p. 141, Oct. 2023, doi: https://doi.org/10.1038/s43856-023-00370-1 .

M. Cascella, J. Montomoli, V. Bellini, and E. Bignami, ‘Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios’, J. Med. Syst., vol. 47, no. 1, p. 33, Mar. 2023, doi: https://doi.org/10.1007/s10916-023-01925-4 .

X. Wang et al , ‘ChatGPT: promise and challenges for deployment in low- and middle-income countries’, Lancet Reg. Health - West. Pac., vol. 41, p. 100905, Dec. 2023, doi: https://doi.org/10.1016/j.lanwpc.2023.100905 .

V. Braun and V. Clarke, ‘Using thematic analysis in psychology’, Qual. Res. Psychol , vol. 3, no. 2, pp. 77–101, Jan. 2006, doi: https://doi.org/10.1191/1478088706qp063oa .

G. Eysenbach, ‘Improving the Quality of Web Surveys: The Checklist for Reporting Results of Internet E-Surveys (CHERRIES)’, J. Med. Internet Res., vol. 6, no. 3, p. e34, Sep. 2004, doi: https://doi.org/10.2196/jmir.6.3.e34 .

A. Tong, P. Sainsbury, and J. Craig, ‘Consolidated criteria for reporting qualitative research (COREQ): a 32-item checklist for interviews and focus groups’, Int. J. Qual. Health Care , vol. 19, no. 6, pp. 349–357, Sep. 2007, doi: https://doi.org/10.1093/intqhc/mzm042 .

J. Y. Luh, R. F. Thompson, and S. Lin, ‘Clinical Documentation and Patient Care Using Artificial Intelligence in Radiation Oncology’, J. Am. Coll. Radiol , vol. 16, no. 9, pp. 1343–1346, Sep. 2019, doi: https://doi.org/10.1016/j.jacr.2019.05.044 .

F. Eggmann, R. Weiger, N. U. Zitzmann, and M. B. Blatz, ‘Implications of large language models such as ChatGPT for dental medicine’, J. Esthet. Restor. Dent , vol. 35, no. 7, pp. 1098–1102, Oct. 2023, doi: https://doi.org/10.1111/jerd.13046 .

A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, ‘Large language models in medicine’, Nat. Med , vol. 29, no. 8, pp. 1930–1940, Aug. 2023, doi: https://doi.org/10.1038/s41591-023-02448-8 .

L. De Angelis et al , ‘ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health’, Front. Public Health, vol. 11, p. 1166120, Apr. 2023, doi: https://doi.org/10.3389/fpubh.2023.1166120 .

S. Reddy, ‘Evaluating large language models for use in healthcare: A framework for translational value assessment’, Inform. Med. Unlocked, vol. 41, p. 101304, 2023, doi: https://doi.org/10.1016/j.imu.2023.101304 .

S. Reddy et al , ‘Evaluation framework to guide implementation of AI systems into healthcare settings’, BMJ Health Care Inform , vol. 28, no. 1, p. e100444, Oct. 2021, doi: https://doi.org/10.1136/bmjhci-2021-100444 .

M. Ahmad, I. Yaramic, and T. D. Roy, ‘Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI’, Computer Science and Mathematics, preprint, Oct. 2023. doi: https://doi.org/10.20944/preprints202310.1662.v1 .

B. Meskó and E. J. Topol, ‘The imperative for regulatory oversight of large language models (or generative AI) in healthcare’, Npj Digit. Med., vol. 6, no. 1, p. 120, Jul. 2023, doi: https://doi.org/10.1038/s41746-023-00873-0 .

K. Denecke, R. May, E. Gabarron, and G. H. Lopez-Campos, ‘Assessing the Potential Risks of Digital Therapeutics (DTX): The DTX Risk Assessment Canvas’, J. Pers. Med., vol. 13, no. 10, p. 1523, Oct. 2023, doi: https://doi.org/10.3390/jpm13101523 .

C. Hardebolle, V. Macko, V. Ramachandran, A. Holzer, and P. Jermann, ‘Digital Ethics Canvas: A Guide For Ethical Risk Assessment And Mitigation In The Digital Domain’, 2023, doi: https://doi.org/10.21427/9WA5-ZY95 .

S. Harrer, ‘Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine’, eBioMedicine , vol. 90, p. 104512, Apr. 2023, doi: https://doi.org/10.1016/j.ebiom.2023.104512 .

M. Mittermaier, M. M. Raza, and J. C. Kvedar, ‘Bias in AI-based models for medical applications: challenges and mitigation strategies’, NPJ Digit. Med., vol. 6, no. 1, p. 113, Jun. 2023, doi: https://doi.org/10.1038/s41746-023-00858-z .

Z. Guo et al , ‘Evaluating Large Language Models: A Comprehensive Survey’, 2023, doi: https://doi.org/10.48550/ARXIV.2310.19736 .

A. Palmer and D. Schwan, ‘Beneficent dehumanization: Employing artificial intelligence and carebots to mitigate shame-induced barriers to medical care’, Bioethics , vol. 36, no. 2, pp. 187–193, Feb. 2022, doi: https://doi.org/10.1111/bioe.12986 .

K. V. Garvey, K. J. Thomas Craig, R. Russell, L. L. Novak, D. Moore, and B. M. Miller, ‘Considering Clinician Competencies for the Implementation of Artificial Intelligence–Based Tools in Health Care: Findings From a Scoping Review’, JMIR Med. Inform , vol. 10, no. 11, p. e37478, Nov. 2022, doi: https://doi.org/10.2196/37478 .

P. Esmaeilzadeh, ‘Use of AI-based tools for healthcare purposes: a survey study from consumers’ perspectives’, BMC Med. Inform. Decis. Mak., vol. 20, no. 1, p. 170, Dec. 2020, doi: https://doi.org/10.1186/s12911-020-01191-1 .

Download references

No funding was received for this project.

Open access funding provided by Bern University of Applied Sciences

Author information

Authors and affiliations.

Institute Patient-centered Digital Health, Bern University of Applied Sciences, Quellgasse 21, Biel, 2502, Switzerland

Kerstin Denecke

Harz University of Applied Sciences, Friedrichstraße 57-59, 38855, Wernigerode, Germany

Richard May

Instituto de Ingeniería Informática (I3US), Universidad de Sevilla, Sevilla, Spain

Octavio Rivera-Romero

Department of Electronic Technology, Universidad de Sevilla, Avda Reina Mercedes s/n, ETSI Informática, G1.43, Sevilla, 41012, Spain

You can also search for this author in PubMed   Google Scholar

Contributions

KD, ORR, and RM designed the study; RM and KD came up with the questions that were commented upon and revised by ORR; KD requested experts to fill the questionnaire; KD and ORR conducted the thematic analysis with conflicts resolved by RM; KD wrote the initial paper draft, which was extended by RM and ORR; all authors agreed with publication of the manuscript. OR prepared the two figures. KD prepared the tables.

Corresponding author

Correspondence to Kerstin Denecke .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Ethics Approval

The study design was submitted to the ethics committee of the Cantone of Berne who confirmed that no ethics approval is necessary (Req-2023-00427).

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Denecke, K., May, R. & Rivera-Romero, O. Transformer Models in Healthcare: A Survey and Thematic Analysis of Potentials, Shortcomings and Risks. J Med Syst 48 , 23 (2024). https://doi.org/10.1007/s10916-024-02043-5

Download citation

Received : 17 November 2023

Accepted : 10 February 2024

Published : 17 February 2024

DOI : https://doi.org/10.1007/s10916-024-02043-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Large Language Model
  • Transformer Models
  • Artificial Intelligence
  • Generative Artificial Intelligence
  • Find a journal
  • Publish with us
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 18 May 2022

A study of transformer-based end-to-end speech recognition system for Kazakh language

  • Mamyrbayev Orken 1 ,
  • Oralbekova Dina 1 , 2 ,
  • Alimhan Keylan 1 , 3 ,
  • Turdalykyzy Tolganay 1 &
  • Othman Mohamed 4  

Scientific Reports volume  12 , Article number:  8337 ( 2022 ) Cite this article

9640 Accesses

14 Citations

Metrics details

  • Computer science
  • Information technology
  • Scientific data

Today, the Transformer model, which allows parallelization and also has its own internal attention, has been widely used in the field of speech recognition. The great advantage of this architecture is the fast learning speed, and the lack of sequential operation, as with recurrent neural networks. In this work, Transformer models and an end-to-end model based on connectionist temporal classification were considered to build a system for automatic recognition of Kazakh speech. It is known that Kazakh is part of a number of agglutinative languages and has limited data for implementing speech recognition systems. Some studies have shown that the Transformer model improves system performance for low-resource languages. Based on our experiments, it was revealed that the joint use of Transformer and connectionist temporal classification models contributed to improving the performance of the Kazakh speech recognition system and with an integrated language model it showed the best character error rate 3.7% on a clean dataset.

Similar content being viewed by others

transformer model research paper

Towards audio-based identification of Ethio-Semitic languages using recurrent neural network

Amlakie Aschale Alemu, Malefia Demilie Melese & Ayodeji Olalekan Salau

transformer model research paper

Dissecting neural computations in the human auditory pathway using deep neural networks for speech

Yuanning Li, Gopala K. Anumanchipalli, … Edward F. Chang

transformer model research paper

Visual speech recognition for multiple languages in the wild

Pingchuan Ma, Stavros Petridis & Maja Pantic

Introduction

Innovative information and digital technologies are increasingly making their way into the life of a modern person: this applies to deep learning systems like voice recognition, images, speech recognition and synthesis. Namely, speech technologies are widely used in communications, robotics and other areas of professional activity. Speech recognition is a way to interact with technology. Speech recognition technology provides recognition of individual words or text, with its further conversion into a sequence of words or commands. There are traditional speech recognition systems that are based on acoustic, language models and lexicon. The acoustic model (AM) was built based on hidden Markov models (HMM) with the Gaussian Mixture Model (GMM), and the language model (LM) was based on n-gram models. The components of these systems were trained separately, which made it difficult to manage and configure them, which led to a decrease in the efficiency of using these systems. With the advent of deep learning, the performance of speech to text systems has improved. Artificial neural networks began to be used for acoustic modeling instead of GMM, which led to improved results that were obtained in many research works 1 , 2 , 3 . Thus, the HMM-DNN architecture has become one of the most common models for continuous speech recognition.

Currently, the end-to-end (E2E) model has become widespread. The E2E structure presents the system as a single neural network, unlike the traditional one, which has several independent elements 4 , 5 . The E2E system provides direct reflection of acoustic signals in the sequence of labels without intermediate states, without the need to perform subsequent processing at the output, which makes it easy to implement. To increase the performance of E2E systems, it is necessary to solve the main tasks related to the definition of the model architecture, the collection of a sufficiently large speech corpus with the appropriate transcription, and the availability of high-performance equipment. Solving these issues ensures the successful implementation of not only speech recognition systems, but also other deep learning systems. In addition, E2E systems can significantly improve the quality of recognition from learning large amounts of training data.

Models based on the Connectionist temporal classification 6 (CTC), models based on the attention mechanism 7 are illustrative examples of end-to-end systems. In a CTC-based model, there is no need to align at the frame level between acoustics and transcription, since a special token is allocated, like an "empty label" which determines the beginning and end of one phoneme 8 . In the attention mechanism based encoder/decoder models, the encoder is an AM—converts input speech into a high-level representation, the attention mechanism is an alignment model, and determines encoded frames that are related to the creation of the current output, the decoder is similar to the AM—operates autoregressive, predicting each output token depending on previous predictions 9 . The above E2E models are based on convolutional and modified recurrent neural networks (RNNs). The models implemented using RNN perform calculations on the character positions of the input and output data, thus generating a sequence of hidden states depending on the previous hidden state of the network. This sequential process does not provide parallelization of learning in training examples, which is a problem with a longer sequence of input data and takes much longer to train the network. In 10 , another Transformer-based model was proposed, which allows parallelization of the learning process, and this model also removes repetitions and uses its internal attention to find the dependencies between the received and resulting data. The big advantage of this architecture is the fast-learning rate and the lack of sequential operation, as with RNN. In previous studies 11 , 12 it was revealed that the combined use of Transformer models and an E2E model, like CTC, contributed to the improvement of the quality of the English and Chinese speech recognition system.

It should be noted that the attention mechanism is a common method that greatly improves the quality of the system in machine translation and speech recognition. And the Transformer model uses this attention mechanism to increase the learning rate. This model has its own internal attention, which aligns all positions of the input sequence to find a representation of the set, which does not require alignments. In addition, Transformer does not need to process the end of the text after processing its start.

In order to implement such models, a large amount of speech data are required for training, which is problematic for languages with limited training data, namely for the Kazakh language, which is included in the group of agglutinative languages. To date, systems have been developed based on the CTC model 13 , 14 for recognizing Kazakh speech with different sets of training data. The use of other methods and models to improve the accuracy of recognition of the Kazakh speech is a promising direction and can improve the performance of the recognition system with a small size of the training sample.

The main goal of our study is to improve the accuracy of the automatic recognition system for Kazakh continuous speech by increasing training data, as well as the use of models based on Transformer and CTC for recognizing Kazakh speech.

The structure of the work is given in the following order: Sect.  2 presents traditional methods of speech recognition, Sect.  3 provides an analytical review of the scientific direction. Section  4 describes the principles of operation of the Transformer-based model and the model we proposed. Further, in Sects.  5 and 6 , our experimental data, corpus of speech, and equipment for the experiment are described, and the results obtained are analyzed. The conclusions are given in the final section..

Traditional speech recognition methods

Traditional sequence recognition focused on estimating the maximum a posteriori probability. Formally, this approach is a transformation of a sequence of acoustic speech characteristics X into a sequence of words W. Acoustic characteristics are a sequence of feature vectors of length T: X  =  {x t   ∈  R D | t  =  1, … , T} , and the sequence of words is defined as W  =  {w n   ∈  V | n  =  1, … , N} , having length N, where V is a vocabulary. The most probable word sequence W ∗ can be estimated by maximizing P(W|X) for all possible word sequences V ∗ (1) 15 . This process can be represented by the following expression:

Therefore, the main goal of the automatic speech recognition (ASR) is to find a suitable model that will accurately determine the posterior distribution \(P\left( {W \, | \, X} \right)\) .

The process of automatic speech recognition consists of sequences of the following steps:

Extraction of features from the input signal.

Acoustic modeling (determines which phones were pronounced for subsequent recognition).

Language modeling (checks the correspondence of spoken words to the most likely sequences).

Decoding a sequence of words spoken by a person.

The most important parts of a speech recognition system are feature extraction methods and recognition methods. Feature extraction is a process that allocates a small amount of data essential for solving a problem. To extract features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) algorithms are commonly used 16 , 17 , 18 . The popular one is MFCC.

In the speech recognition task, the original signal is converted into feature vectors, on the basis of which classification will then be performed.

Acoustic model

The acoustic model (AM) uses deep neural networks and hidden Markov models. Deep neural network, convolutional neural network (CNN), or long short-term memory, which is a variant of the recurrent neural network is used to map the acoustic frame x t to the phonetic state of the subsequent f t at each input time t (2):

Before this acoustic modeling procedure, the output targets of the neural network models, a sequence of phonetic states at the frame level f 1:T , are generated by HMM and GMM in special training methods. GMM models the acoustic element at the frame level x 1:T , and HMM estimates the most probable sequence of phonetic states f 1:T .

The acoustic model is optimized for the cross-entropy error, which is the phonetic classification error per frame.

Language model

The language model p(w) models the most probable sequences of words regardless of acoustics (3):

where w < u is the previous recognized word.

Currently, RNN or LSTM are commonly used extensively for language model architecture, as they can capture long-term dependencies rather than traditional n-gram models, which are based on the Markov assumption and limited to a certain n-range of word history.

Hidden Markov models

For a long time, a system based on hidden Markov models (HMM) was the main model for continuous speech recognition. The HMM mechanism can be used not only in acoustic modeling but also in the language model. But in general, the use of the HMM model gives a greater advantage when modeling the acoustic component.

In this HMM, the phone is the observation and the feature is the latent state. For an HMM that has a state set {1,…, J}, the HMM-based model uses the Bayesian theorem and introduces the HMM state sequence S = {s t   ∈  {1,…, J} | t = 1,…, T} пo p (L|X) (4).

p(X|S), p(S|L), and p(L) in Eq. ( 4 ) correspond to the acoustic model, the pronunciation model and the language model, respectively.

The acoustic model P (X|S) indicates the probability of observing X from the hidden sequence S. According to the probability chain rule and the observation independence hypothesis in the HMM (observations at any time depend only on the hidden state at that time), P(X|S) can be decomposed into the following form (5):

In the acoustic model, p (x t |s t ) is the probability of observation, which is usually represented by mixtures of Gaussian distributions. The distribution of the posteriori probability of the hidden state p (s t |x t ) can be calculated by the method of deep neural networks.

Two approaches, HMM-GMM and HMM-DNN, can be used to calculate p (X|S) in Eq. 5 . The first approach HMM-GMM was for a long time the main method for building speech-to-text technology. With the development of deep learning technology, DNN is introduced into speech recognition for acoustic modeling. The role of DNN is to calculate the posterior probability of the HMM state, which can be converted into probabilities, replacing the usual GMM observation probability. Consequently, the transition of HMM-GMM to the hybrid model HMM-DNN has yielded excellent recognition results, and is becoming a popular ASR architecture.

Hybrid models have some important limitations. For example, ANN with more than two hidden levels were rarely used due to computational performance limitations, and the context-dependent model described above takes into account numerous effective methods developed for GMM-HMM.

The learning process is complex and difficult for global optimization. Components of traditional models are usually trained on different datasets and methods.

Hybrid models based on DNN-HMM

To calculate  P(x t |s t ) directly, GMM was used, because this model gives the possibility to simulate the distribution for each state, allowing to obtain probability values of input sequences. However, in practice, these assumptions cannot always be modeled by GMM. DNNs have shown significant improvements over GMMs due to their ability to study nonlinear functions. DNN cannot directly provide a conditional probability. The frame-by-frame posterior distribution is used to turn the probability model P(x t |s t ) into a classification problem P(s t |x t ) using a pseudo-likelihood trick as a joint probability approximation (6) 15 . The application this probability is referred to as a "hybrid  architecture".

A numerator is a DNN classifier trained with a set of input functions as input x t and target state s t . The denominator P(st) is the prior probability of the state s t . Frame-by-frame training requires frame-by-frame alignment with x t as input and s t as target. This negotiation is usually achieved by using a weaker HMM/GMM negotiation system or using human-made dictionaries. The quality and quantity of alignment labels are usually the most significant limitations of the hybrid approach.

End-to-end speech recognition models

E2E automatic speech recognition is a new technology in the field of ASR based on a neural network, which offers many advantages. E2E ASR is a single integrated approach with a much simpler training approach with models that work at a low audio frame rate. This reduces learning time, decoding time, and allows joint optimization with subsequent processing, such as understanding the natural language.

For the global calculation of P(W | X) using E2E speech recognition models, the input can be represented as a sequence of acoustic features X  =  (x 1 ,…, x t ) , the sequence of target marks as y  =  (y 1 ,…, y t ) , and the sequences words in the form W  =  w m  =  (w 1 ,…, w m ) .

Thus, the ANN finds probabilities P(∙|x 1 ),…,P(∙|x t ) , where the input probability parameters are some representations of a sequence of words, i.e. labels.

The basic principle of operation is that modern E2E models are trained on the basis of big data. From the above, we can detect the main problem, it concerns the recognition of languages with limited training data, such as Kazakh, Kyrgyz, Turkish, etc. For such low-resource languages, there are no large corpuses of training data.

Related work/literature review

The Transformer model was first introduced in 8 , in order to reduce sequential calculations and the number of operations for correlating input and output position signals. Experiments were conducted on machine translation tasks, from English to German and from English to French. As a result, the model was shown to have achieved good performance compared to existing results. Moreover, Transformer works perfectly for other tasks with large and limited training data, and is very fruitful for all kinds of seq2seq tasks.

The use of Transformer for speech-to-text conversion also showed good results and was reflected in the following research papers:

To implement a faster and more accurate ASR system, Transformer and ASR achievements based on RNN were combined by Karita et al. 11 . To build the model, a Connectionist temporal classification (CTC) was E2E with Transformer for co-learning and decoding. This approach speeds up learning and facilitates LM integration. The proposed ASR system implements significant improvements in various ASR tasks. For example, it lowered WER from 11.1% to 4.5% for the Wall Street Journal and from 16.1% to 11.6% for TED-LIUM, introducing CTC and LM integration into the Transformer baseline.

Moritz et al. 19 proposed a Transformer-based model for streaming speech recognition that requires an entire speech utterance as input. Time-limited self-attention in the encoder and triggered attention for the encoder-decoder with attention mechanism were applied to generate the output after the spoken word. The model architecture achieved the best result in E2E streaming speech recognition − 2.8% and 7.3% WER for "pure" and "other" LibriSpeech test data.

The Weak-Attention Suppression (WAS) method was proposed by Yangyang Shi and other 20 , which dynamically causes sparse attention probabilities. This method suppresses the attention of uncritical and redundant continuous acoustic frames and is more likely to suppress past frames than future ones. It was shown that the proposed method leads to a decrease in WER compared to the basic types of Transformer. In Test LibriSpeech, the proposed WAS method reduced WER by 10% in cleanliness testing and by 5% in another test for streaming Transformers, which led to a new advanced level among streaming models.

Dong Linhao and co-authors 21 presented a Speech-Transformer system using a 2D attention mechanism that co-processes the time and frequency axes of 2D speech inputs, thereby providing more expressive representations for the Speech-Transformer. The Wall Street Journal (WSJ) corpus was used as training data. The results of the experiment showed that this model allows to reduce the training time and at the same time can provide a competitive WER.

Gangi et al. 22 suggested Transformer with SLT adaptation—an architecture for spoken language translation, for processing long input sequences with low information density to solve ASR problems. The adaptation was based on downsampling the input data using convolutional neural networks and modeling the two-dimensional nature of the audio spectrogram using 2D components. Experiments show that the SLT-adapted Transformer outperforms the RNN-based baseline in both translation quality and learning time, providing high performance in six language areas.

Takaaki Hori et al. 23 advanced the Transformer architecture, on the basis of which a context window was developed, which was trained in monologue and dialogue scenarios. Monologue tests on CSJ and TED-LIUM3 and dialog tests on SWITCHBOARD and HKUST were applied. As a result, results were obtained that surpass the basic E2E ASR with one sound and with or without speaker i-vectors.

In the E2E system, the RNN-based encoder-decoder model was replaced by the Transformer architecture in Chang X. et al. research 24 . And in order to use this model in the masking network of the neural beamformer in the multi-channel case, the self-attention component has been modified so that it is limited to a segment, rather than the entire sequence, in order to reduce the amount of computation. In addition to improvements to the model architecture, preprocessing of external dereverberation, weighted prediction error (WPE), was also included, which allows the model to process reverberated signals. Experiments with the extended wsj1-2mix corpus show that Transformer-based models achieve better results in echo-free conditions in single-channel and multi-channel modes, respectively.

Transformer architecture

The Transformer model was first created for machine translation, replacing recurrent neural networks (RNNs) in natural language processing (NLP) tasks. In this model, recurrence was completely eliminated, instead, for each statement, using the internal attention mechanism (self-attention mechanism), signs were built to identify the significance of other sequences for this utterance. Therefore, the generated features for a given statement are the result of linear transformations of sequence features that are significant.

The Transformer model consists of one large block, which in turn consists of blocks of encoders and decoders (Fig.  1 ). Here, the encoder takes as input the feature vectors from the audio signal X  =  (x 1 ,…,x T ) and outputs a sequence of intermediate representations. Further, based on the received representations, the decoder reproduces the output sequence W  =  w m  =  (w 1 ,…,w M ) . Each stage of the model uses the previous symbols to output the next, because it is autoregressive. The Transformer architecture uses several layers of self-attention in the encoder and decoder blocks that are interconnected with each other. Consider each block individually.

figure 1

General scheme of the model.

Encoder and decoder networks

Conventional E2E encoder/decoder models for speech recognition tasks consist of a single encoder and decoder, an attention mechanism. The encoder converts the vector of acoustic features into an alternative representation, and the decoder predicts a sequence of labels from the alternative information provided by the encoder, then attention highlights the significant parts of the frame for predicting the output. In contrast to these models, the Transformer model can have several encoders and decoders, and each of them contains its own internal attention mechanism.

An encoder block consists of sets of encoders; as a study, 6 coders are usually taken, which are located one above the other. The number of encoders is not fixed, it is possible to experiment with an arbitrary number of encoders in a block. All encoders have the same structure but different weights. The input of the encoder receives extracted feature vectors from the audio signal, obtained using Mel-frequency cepstral coefficients or convolutional neural networks. Then the first encoder transforms these data using self-attention into a set of vectors, and through the feed forward ANN transmits the received outputs to the next encoder. The last encoder processes the vectors and transfers the data of the encoded functions to the decoder block.

A decoder block is a set of decoders, and their number is usually identical to the number of encoders. Each part of the encoder can be divided into two sublayers: the input data entering the encoder first passes through the multi-head attention layer, which helps the encoder look at other words in the incoming sentence during encoding of a particular word. The output of the inner multi-head attention layer is sent to the feed-forward neural network. The exact same network is independently applied to each word in the sentence.

The decoder also contains two of these layers, but there is an attention layer between them that helps the decoder focus on significant parts of the incoming sentence, as is similar to the usual attention mechanism in seq2seq models. This component will take into account previous characters/words and, based on these data, outputs the posterior probabilities of the subsequent character/words.

Self-attention mechanism

The Transformer model includes Scaled Dot-Product Attention 10 . The advantage of self-attention is fast calculation and shortening of the path between words, as well as potential interpretability. This attention includes 3 vectors: queries, keys and values, and scaling (7):

These parameters are considered useful for calculating attention. Multi-head attention combines several self-attention maps into general matrix calculations (8):

Here \({s}_{h}=Attention(Q{W}_{h}^{Q}, K{W}_{h}^{K}, V{W}_{h}^{V})\) . h is the amount of attention in the layer, \(Q{W}_{h}^{Q}, K{W}_{h}^{K}, V{W}_{h}^{V}, {s}_{h}\) – trained weight matrices.

The multi-head attention mechanism can be used as an optimization problem. Using this mechanism, you can bypass problems associated with unsuccessful initialization, as well as improve the speed of training. In addition, after training, you can exclude some parts of the heads of attention, since these changes will not affect the quality of decoding in any way. The number of heads in the model is designed to regulate attention mechanisms. In addition, this mechanism helps the network to easily access any information, regardless of the length of the sequence, because this is done easily, regardless of the number of words in the set.

In the Transformer architecture, you can see the Normalize element, which is necessary to normalize feature values, since after using the attention mechanism, these values can have different values. As a normalization, the Layer Normalization method is usually used (Fig.  2 ).

figure 2

Transformer Model.

The outputs of several heads can also be different, and in the final vector the spread of values can be large. To prevent this, an approach has been proposed 11 where values at each position are converted with a two-layer perception. After applying the attention mechanism, the values are projected to a larger dimension using the trained weights, where they are then transformed by the nonlinear activation function ReLU, and then these values are projected to the original dimension, after which the next normalization occurs.

Proposed model

Typically, Connectionist temporal classification (CTC) is used as a loss function to train recurrent neural networks to recognize input speech without pre-aligning the input and output data 11 . To achieve high performance from the CTC model, it is necessary to use an external language model, since direct decoding will not work correctly. In addition, the Kazakh language has a rather diverse mechanism of word formation, which the use of language mode contributes to an increase in the quality of recognition of Kazakh speech.

In this work, we will jointly use the Transformer and CTC models with LM. The use of LM CTC in decoding results in rapid model convergence, which reduces the amount of time to decode and improves system performance. The CTC function, after receiving the output from the encoder, finds the probability by formula 9 for arbitrary alignment between the encoder output and the output symbol sequence.

Here \(x\) is the output vector of the encoder, R is an additional operator for removing blank spaces and repeated symbols, \(\gamma\) is a series of predicted symbols. This equation determines the sum of all alignments using dynamic programming, and helps to train the neural network on unlabeled data.

The general structure of the resulting model is shown in Fig.  3 .

figure 3

The structure of our model.

During training, the multi-task loss method was used to bring the general formula for combining probabilities according to the negative logarithm, as presented in 10 .

Thus, the resulting model can be represented by the following expression ( 10 ):

where \(\lambda\) —configurable parameter and satisfies the condition— \(0\le \lambda \le 1\) .

The following additions have been included to improve model performance:

(1) Using a character-level language model in feature extraction. Convolutional neural networks were used to extract features. To extract high-dimensional features from the audio data, we first wrap all the network parameters under the last hidden CNN layer. Softmax was used as an activation function. Next, a maxpooling layer was added to eliminate noise signals and reduce noise with dimensionality reduction. This layer is needed to reduce the size of the collapsed element into a vector. Also it helps to reduce the processing power required for data processing by reducing the dimensionality. And adaptation of training with character-level language model, without disturbing the structure of the neural network during training, allows us to preserve maximum non-linearity for subsequent processing. Thus, our extracted features are already high-level, and there is no need to map these raw data to phonemes.

(2) Application of a language model at the level of words and phrases when decoding together with CTC.

To measure the quality of the Kazakh speech recognition system, the following parameters were used: CER—the number of incorrectly recognized characters, because characters are the most common and simple output units for generating texts; and based on the word error rate (WER) 25 .

Data availability

Not applicable.

Seide, G. L., & Yu, D. Conversational Speech. Transcription Using Context-Dependent Deep Neural. Networks. Interspeech (2011).

Bourlard, H., & Morgan, N. Connectionist speech recognition: A hybrid approach. p. 352 (1993) https://doi.org/10.1007/978-1-4615-3210-1 .

Smit, P., Virpioja, S. & Kurimo, M. Advances in subword-based HMM-DNN speech recognition across languages. Comput. Speech Lang. 66 , 1. https://doi.org/10.1016/j.csl.2020.101158 (2021).

Article   Google Scholar  

Wang, D., Wang, X. & Lv, S. An overview of end-to-end automatic speech recognition. Symmetry 11 , 1018. https://doi.org/10.3390/sym11081018 (2019).

Mamyrbayev, O. & Oralbekova, D. Modern trends in the development of speech recognition systems. News of the National academy of sciences of the republic of Kazakhstan 4 (332), 42–51 (2020).

Google Scholar  

Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML, Pittsburgh, USA, 2006

Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. Listen attend and spell. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2016

Cui, X., & Gong, Y. Variable parameter Gaussian mixture hidden Markov modeling for speech recognition. 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)., 2003, pp. I-I. https://doi.org/10.1109/ICASSP.2003.1198704 .

Yan, Y., Qi, W., Gong, Y., Liu, D., Duan, N., Chen, J., Zhang, R., & Zhou, M. ProphetNet: Predicting future N-gram for sequence-to-sequence pre-training. arXiv - CS - Computation and Language, 2020. arxiv-2001.04063.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010 (2017).

Karita, S., Soplin, N. E. Y., Watanabe, S., Delcroix, M., Ogawa, A., & Nakatani, T. (2019). Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-September, 1408–1412. https://doi.org/10.21437/Interspeech.2019-1938 .

Miao, H., Cheng, G., Gao, C., Zhang, P., & Yan, Y. Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6084–6088 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053165 .

Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., & Bekarystankyzy, A. End-to-End Model Based on RNN-T for Kazakh Speech Recognition. In 2021 3rd International Conference on Computer Communication and the Internet (ICCCI), 2021, pp. 163–167. https://doi.org/10.1109/ICCCI51764.2021.9486811 .

Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A. & Zhumazhanov, B. Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-Eur. J. Enterpris. Technol. 19 (115), 84–92 (2022).

Kamath, U., Liu, J. & Whitaker, J. Deep Learning for NLP and Speech Recognition (Springer, 2019).

Book   Google Scholar  

El-Henawy, I. M., Khedr, W. I. & ELkomy OM, Abdalla A-ZMI,. Recognition of phonetic Arabic figures via wavelet-based Mel Frequency Cepstrum using HMMs. HBRC J. 10 (1), 49–54 (2014).

Mohan, B. J. & Ramesh Babu, N. Speech recognition using MFCC and DTW. International Conference on Advances in Electrical Engineering (ICAEE) 1 , 1–4. https://doi.org/10.1109/ICAEE.2014.6838564 (2014).

Dave, N. Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1 , 1 (2013).

Moritz, N., Hori, T., & Le, J. Streaming Automatic Speech Recognition with the Transformer Model. In ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6074–6078. https://doi.org/10.1109/ICASSP40776.2020.9054476 .

Shi, Y., Wang, Y., Wu, C., Fuegen, C., Zhang, F., Le, D., Yeh, C., & Seltzer, M. Weak-attention suppression for transformer-based speech recognition. ArXiv abs/2005.09137 (2020).

Dong, L., Xu, S., & Xu, B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 5884–5888 (2018).

Gangi, M. A. D., Negri, M., Cattoni, R., Dessì, R., & Turchi, M. Enhancing Transformer for End-to-end Speech-to-Text Translation. MTSummit (2019).

Hori, T., Moritz, N., Hori, C., & Roux, J. L. Transformer-based Long-context End-to-end Speech Recognition. INTERSPEECH 2020, Shanghai, China (2020).

Chang, X., Zhang, W., Qian, Y., Le Roux, J., & Watanabe, S. End-to-end multi-speaker speech recognition with transformer. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Levenshtein, V. I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Doklady 10 , 707–710 (1996).

ADS   MathSciNet   Google Scholar  

Mamyrbayev, O. et al. Development of security systems using DNN and i & x-vector classifiers. East.-Eur. J. Enterpris. Technol. 49 (112), 32–45 (2021).

Kingma, D. P., & Adam, B. J. A method for stochastic optimization. arXiv, 2014. http://arxiv.org/abs/1412 . 6980 (data of request: 18.04.2021).

LeCun, Y., Bottou, L., Orr, G. B., & M¨uller, K.-R. Efficient backprop. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, 1998, pp. 9–50.

Download references

Acknowledgements

This research has been funded by the Science Committee of the Ministry of Education and Science of the Republic Kazakhstan (Grant No. AP08855743).

This study was funded by the Science Committee of the Ministry of Education and Science of the Republic Kazakhstan (Grant No. AP08855743).

Author information

Authors and affiliations.

Institute of Information and Computational Technologies CS MES RK, Almaty, Kazakhstan

Mamyrbayev Orken, Oralbekova Dina, Alimhan Keylan & Turdalykyzy Tolganay

Satbayev University, Almaty, Kazakhstan

Oralbekova Dina

L.N. Gumilyov Eurasian National University, Nur-Sultan, Kazakhstan

Alimhan Keylan

Universiti Putra Malaysia, Kuala Lumpur, Malaysia

Othman Mohamed

You can also search for this author in PubMed   Google Scholar

Contributions

O.M. built a model and applied transfer learning to realized recognition model and participated in the preparation of the manuscript, K.A. and M.O. carried out the analysis of literatures on the topic under study, D.O. built an end-to-end model based on Transformer, participated in the research and prepared the manuscript, T.T. prepared data for training, D.O. helped in drawing up the program. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Oralbekova Dina .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Orken, M., Dina, O., Keylan, A. et al. A study of transformer-based end-to-end speech recognition system for Kazakh language. Sci Rep 12 , 8337 (2022). https://doi.org/10.1038/s41598-022-12260-y

Download citation

Received : 28 December 2021

Accepted : 05 May 2022

Published : 18 May 2022

DOI : https://doi.org/10.1038/s41598-022-12260-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

A comprehensive survey on automatic speech recognition using neural networks.

  • Amandeep Singh Dhanjal
  • Williamjeet Singh

Multimedia Tools and Applications (2023)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

transformer model research paper

Help | Advanced Search

Computer Science > Computation and Language

Title: introduction to transformers: an nlp perspective.

Abstract: Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications. Given that Transformers and related deep learning techniques might be evolving in ways we have never seen, we cannot dive into all the model details or cover all the technical areas. Instead, we focus on just those concepts that are helpful for gaining a good understanding of Transformers and their variants. We also summarize the key ideas that impact this field, thereby yielding some insights into the strengths and limitations of these models.

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. Transformer Model Architecture. Transformer Architecture [26] is

    transformer model research paper

  2. Transformer architecture (figure sourced from original paper [26

    transformer model research paper

  3. Modeling Of Power Transformer for Differential

    transformer model research paper

  4. Transformers in NLP: A beginner friendly explanation

    transformer model research paper

  5. The Ultimate Guide to Transformer Deep Learning

    transformer model research paper

  6. (PDF) Tests of Single-Phase Transformer

    transformer model research paper

VIDEO

  1. Transformer Model in 60 seconds #ai #ml #llm #gpt #artificialintelligence #learning

  2. power transformer model

  3. AeroVertical: VTOL convertiplane RC model research. Part 1

  4. மாதிரி ஆராய்ச்சி கட்டுரைகளில் இருந்து எவ்வாறு எழுத கற்றுக்கொள்வது? Learning

  5. Deep Dive into NLP: Live Study Session on 'Attention is All You Need'

  6. TRANSFORMER| MODEL PAPER 2024| MODEL PAPER SOLUTION| BIHAR BOARD 2024| PHYSICS MODEL PAPER SOLUTION

COMMENTS

  1. [1706.03762] Attention Is All You Need

    The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely ...

  2. [2302.07730] Transformer models: an introduction and catalog

    The paper also includes an introduction to the most important aspects and innovations in Transformer models. Our catalog will include models that are trained using self-supervised learning (e.g., BERT or GPT3) as well as those that are further trained using a human-in-the-loop (e.g. the InstructGPT model used by ChatGPT). Subjects:

  3. Attention is All You Need

    The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while ...

  4. Transformer: A Novel Neural Network Architecture for Language

    In " Attention Is All You Need ", we introduce the Transformer, a novel neural network architecture based on a self-attention mechanism that we believe to be particularly well suited for language understanding. In our paper, we show that the Transformer outperforms both recurrent and convolutional models on academic English to German and ...

  5. Transformer Explained

    A Transformer is a model architecture that eschews recurrence and instead relies entirely on an attention mechanism to draw global dependencies between input and output. Before Transformers, the dominant sequence transduction models were based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The Transformer also employs an encoder and decoder, but ...

  6. PDF Attention is All you Need

    While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads. 5We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively. Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model.

  7. (PDF) Transformer models: an introduction and catalog

    The paper also includes an introduction to the most important aspects and innovation in Transformer models. Reinforcement Learning with Human Feedback. From HuggingFace's RLHF blog post at https ...

  8. A comprehensive survey on applications of transformers for deep

    The advantages of the Transformer model have inspired deep learning researchers to explore its potential for various tasks in different fields of application (Ren, Li, & Liu, 2023), leading to numerous research papers and the development of Transformer-based models for a range of tasks in the field of artificial intelligence (Reza et al., 2022 ...

  9. A survey of transformers

    1. Introduction. Transformer (Vaswani et al., 2017) is a prominent deep learning model that has been widely adopted in various fields, such as natural language processing (NLP), computer vision (CV) and speech processing.Transformer was originally proposed as a sequence-to-sequence model (Sutskever et al., 2014) for machine translation.Later works show that Transformer-based pre-trained models ...

  10. Talk About Transformation

    Origins of the Transformer Model. The research team initially sought to overcome the limitations of recurrent neural networks, or RNNs, which were then the state of the art for processing language data. Noam Shazeer, cofounder and CEO of Character.AI, compared RNNs to the steam engine and transformers to the improved efficiency of internal ...

  11. An Overview of Transformers

    2019. 1. Transformers are a type of neural network architecture that have several properties that make them effective for modeling data with long-range dependencies. They generally feature a combination of multi-headed attention mechanisms, residual connections, layer normalization, feedforward connections, and positional embeddings.

  12. Attention is all you need: Discovering the Transformer paper

    Picture by Vinson Tan from Pixabay. In this post we will describe and demystify the relevant artifacts in the paper "Attention is all you need" (Vaswani, Ashish & Shazeer, Noam & Parmar, Niki & Uszkoreit, Jakob & Jones, Llion & Gomez, Aidan & Kaiser, Lukasz & Polosukhin, Illia. (2017))[1].This paper was a great advance in the use of the attention mechanism, being the main improvement for a ...

  13. PDF Introduction to Transformers: an NLP Perspective

    Transformers have dominated empirical machine learning models of natural language pro-cessing. In this paper, we introduce basic concepts of Transformers and present key tech-niques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applica-

  14. Overview of the Transformer-based Models for NLP Tasks

    these Transf ormer-based architectures. In this paper, we provide an overview and explanations of the. latest models. We cover the auto-r egressive models such as GPT, GPT -2 and XLNET, as well as ...

  15. Transformer models for text-based emotion detection: a ...

    We cannot overemphasize the essence of contextual information in most natural language processing (NLP) applications. The extraction of context yields significant improvements in many NLP tasks, including emotion recognition from texts. The paper discusses transformer-based models for NLP tasks. It highlights the pros and cons of the identified models. The models discussed include the ...

  16. Transformer Architecture and Attention Mechanisms in Genome Data

    1. Introduction. The revolution of deep learning methodologies has invigorated the field of bioinformatics and genome data analysis, establishing a foundation for ground-breaking advancements and novel insights [1,2,3,4,5,6].Recently, the development and application of transformer-based architectures and attention mechanisms have demonstrated superior performance and capabilities in handling ...

  17. Transformer Models in Healthcare: A Survey and Thematic ...

    The research objective of this paper is therefore to identify the potentials, shortcomings and risks associated with the use of transformer models in healthcare by conducting a qualitative study with 28 participants. ... Transformer model-based systems must comply with privacy regulations and protect the privacy of sensitive health data ...

  18. A study of transformer-based end-to-end speech recognition ...

    In Table 1, it can be seen that the Transformer model with CTC works well with and without the use of the language model and achieved a CER of 6.2% and a WER of 13.5%. The integration of an ...

  19. A Comprehensive Review of Transformer models and their ...

    Natural Language Processing has revolutionized the word especially after the advent of transformers. So much research is going on in NLP that a comprehensive review paper was needed to analyze how far we have gone form the starting point and how different variants of transformers are used in building machine translation models.

  20. (PDF) A Review of Transformer Models

    A Open Research Knowledge Graph (ORKG) Catalog of Transformer Large Language Models (LLMs) In the past few years we have seen the meteoric appearance of dozens of models of the Transformer family ...

  21. [2311.17633] Introduction to Transformers: an NLP Perspective

    Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications. Given that Transformers and related deep ...

  22. A Transformer approach for Electricity Price Forecasting

    The paper also provides fair comparison of the models using the open-source EPF toolbox and provide the code to enhance reproducibility and transparency in EPF research. The results show that the Transformer model outperforms traditional methods, offering a promising solution for reliable and sustainable power system operation.

  23. NVIDIA Blackwell Platform Arrives to Power a New Era of Computing

    Second-Generation Transformer Engine — Fueled by new micro-tensor scaling support and NVIDIA's advanced dynamic range management algorithms integrated into NVIDIA TensorRT™-LLM and NeMo Megatron frameworks, Blackwell will support double the compute and model sizes with new 4-bit floating point AI inference capabilities.