Generative Data Intelligence

Reformer, Longformer, and ELECTRA: Key Updates To Transformer Architecture In 2020

Date:

The introduction of transfer learning and pretrained language models in natural language processing (NLP) pushed forward the limits of language understanding and generation. Transfer learning and applying transformers to different downstream NLP tasks have become the main trend of the latest research advances.

The leading pre-trained language models demonstrate remarkable performance on different NLP tasks, making them a much-welcomed tool for a number of applications, including sentiment analysis, chatbots, text summarization, and so on. However, good performance usually comes at the cost of enormous computational resources that are not accessible by most researchers and business practitioners.

To address this issue, different research groups are working on increasing the compute-efficiency and parameter-efficiency of the pre-trained language models without sacrificing their accuracy. Among the novel approaches introduced this year, at least three methods are appraised by the AI community as very promising. To help you stay aware of the latest NLP research advancements, we have summarized the corresponding research papers in an easy-to-read bullet-point format.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

If you’d like to skip around, here are the papers we featured:

  1. Reformer: The Efficient Transformer
  2. Longformer: The Long-Document Transformer
  3. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

State-of-the-art Transformers in 2020

1. Reformer: The Efficient Transformer, by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya

Original Abstract 

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L2) to O(L log L), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

Our Summary 

The leading Transformer models have become so big that they can be realistically trained only in large research laboratories. To address this problem, the Google Research team introduces several techniques that improve the efficiency of Transformers. In particular, they suggest (1) using reversible layers to allow storing the activations only once instead of for each layer, and (2) using locality-sensitive hashing to avoid costly softmax computation in the case of full dot-product attention. Experiments on several text tasks demonstrate that the introduced Reformer model matches the performance of the full Transformer but runs much faster and with much better memory efficiency.

Reformer - NLP

Reformer - NLP

Locality-Sensitive Hashing Attention showing the hash-bucketing, sorting, and chunking steps, and the resulting causal attentions, together with the corresponding attention matrices (a–d)

What’s the core idea of this paper?

  • The leading Transformer models require huge computational resources because of the very high number of parameters and several other factors:
    • The activations of every layer need to be stored for back-propagation.
    • The intermediate feed-forward layers account for a large fraction of memory use since their depth is often much larger than the depth of attention activations.
    • The complexity of attention on a sequence of length L is O(L2).
  • To address these problems, the research team introduces the Reformer model with the following improvements:
    • using reversible layers to store only a single copy of activations;
    • splitting activations inside the feed-forward layers and processing them in chunks;
    • approximating attention computation based on locality-sensitive hashing.

What’s the key achievement?

  • By analyzing the introduced techniques one by one, the authors show that model accuracy is not sacrificed by:
    • switching to locality-sensitive hashing attention;
    • using reversible layers.
  • Reformer performs on par with the full Transformer model while demonstrating much higher speed and memory efficiency:
    • For example, on the newstest2014 task for machine translation from English to German, the Reformer base model gets a BLEU score of 27.6 compared to Vaswani’s et al. (2017) BLEU score of 27.3.

What does the AI community think?

  • The paper was selected for oral presentation at ICLR 2020, the leading conference in deep learning.

What are possible business applications?

  • The suggested efficiency improvements enable more widespread Transformer application, especially for the tasks that depend on large-context data, such as:
    • text generation;
    • visual content generation;
    • music generation;
    • time-series forecasting.

Where can you get implementation code?

  • The official code implementation from Google is publicly available on GitHub.
  • The PyTorch implementation of Reformer is also available on GitHub.

2. Longformer: The Long-Document Transformer, by Iz Beltagy, Matthew E. Peters, Arman Cohan

Original Abstract 

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.

Our Summary 

Self-attention is one of the key factors behind the success of Transformer architecture. However, it also makes transformer-based models hard to apply to long documents. The existing techniques usually divide the long input into a number of chunks and then use complex architectures to combine information across these chunks. The research team from the Allen Institute for Artificial Intelligence introduces a more elegant solution to this problem. The suggested Longformer model employs an attention pattern that combines local windowed attention with task-motivated global attention. This attention mechanism scales linearly with the sequence length and enables processing of documents with thousands of tokens. The experiments demonstrate that Longformer achieves state-of-the-art results on character-level language modeling tasks, and when pre-trained, consistently outperforms RoBERTa on long-document tasks.

Longformer - NLP

Longformer - NLP

Full self-attention pattern vs. Longformer’s configuration of attention patterns

What’s the core idea of this paper?

  • The computational requirements of self-attention grow quadratically with sequence length, making it hard to process on current hardware.
  • To address this issue, the researchers present Longformer, a modified version of Transformer architecture that:
    • allows memory usage to scale linearly, and not quadratically, with the sequence length;
    • includes an attention mechanism that combines:
      • a windowed local-context self-attention to build contextual representations;
      • an end task motivated global attention to encode inductive bias about the task and build full sequence representation.
  • Since the implementation of the sliding window attention pattern requires a form of banded matrix multiplication that is not supported in the existing deep learning libraries like PyTorch and Tensorflow, the authors also introduce a custom CUDA kernel for implementing these attention operations.

What’s the key achievement?

  • The Longformer model achieves a new state of the art on character-level language modeling tasks:
    • BPC of 1.10 on text8;
    • BPC of 1.00 on enwik8.
  • After pre-training and fine-tuning for six tasks, including classification, question answering, and coreference resolution, the Longformer-base consistently outperformers the RoBERTa-base with:
    • accuracy of 75.0 vs. 72.4 on WikiHop;
    • F1 score of 75.2 vs. 74.2 on TriviaQA;
    • joint F1 score of 64.4 vs. 63.5 on HotpotQA;
    • average F1 score of 78.6 vs. 78.4 on the OntoNotes coreference resolution task;
    • accuracy of 95.7 vs. 95.3 on the IMDB classification task;
    • F1 score of 94.0 vs. 87.4 on the Hyperpartisan classification task.
  • The performance gains are especially remarkable for the tasks that require a long context (i.e., WikiHop and Hyperpartisan).

What are future research areas?

  • Exploring other attention patterns that are more efficient due to dynamic adaptation to the input.
  • Applying Longformer to other relevant long document tasks such as summarization.

What are possible business applications?

  • The Longformer architecture can be very advantageous for the downstream NLP tasks that often require processing of long documents:
    • document classification;
    • question answering;
    • coreference resolution;
    • summarization;
    • semantic search.

Where can you get implementation code?

  • The code implementation of Longformer is open-sourced on GitHub.

3. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

Original Abstract 

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30× more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

Our Summary 

The pre-training task for popular language models like BERT and XLNet involves masking a small subset of unlabeled input and then training the network to recover this original input. Even though it works quite well, this approach is not particularly data-efficient as it learns from only a small fraction of tokens (typically ~15%). As an alternative, the researchers from Stanford University and Google Brain propose a new pre-training task called replaced token detection. Instead of masking, they suggest replacing some tokens with plausible alternatives generated by a small language model. Then, the pre-trained discriminator is used to predict whether each token is an original or a replacement. As a result, the model learns from all input tokens instead of the small masked fraction, making it much more computationally efficient. The experiments confirm that the introduced approach leads to significantly faster training and higher accuracy on downstream NLP tasks.

ELECTRA - NLP

ELECTRA - NLP

What’s the core idea of this paper?

  • Pre-training methods that are based on masked language modeling are computationally inefficient as they use only a small fraction of tokens for learning.
  • Researchers propose a new pre-training task called replaced token detection, where:
    • some tokens are replaced by samples from a small generator network;
    • a model is pre-trained as a discriminator to distinguish between original and replaced tokens.
  • The introduced approach, called ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately):
    • enables the model to learn from all input tokens instead of the small masked-out subset;
    • is not adversarial, despite the similarity to GAN, as the generator producing tokens for replacement is trained with maximum likelihood.

What’s the key achievement?

  • Demonstrating that the discriminative task of distinguishing between real data and challenging negative samples is more efficient than existing generative methods for language representation learning.
  • Introducing a model that substantially outperforms state-of-the-art approaches while requiring less pre-training compute:
    • ELECTRA-Small gets a GLUE score of 79.9 and outperforms a comparably small BERT model with a score of 75.1 and a much larger GPT model with a score of 78.8.
    • An ELECTRA model that performs comparably to XLNet and RoBERTa uses only 25% of their pre-training compute.
    • ELECTRA-Large outscores the alternative state-of-the-art models on the GLUE and SQuAD benchmarks while still requiring less pre-training compute.

What does the AI community think?

  • The paper was selected for presentation at ICLR 2020, the leading conference in deep learning.

What are possible business applications?

  • Because of its computational efficiency, the ELECTRA approach can make the application of pre-trained text encoders more accessible to business practitioners.

Where can you get implementation code?

  • The original TensorFlow implementation and pre-trained weights are released on GitHub.

If you like these research summaries, you might be also interested in the following articles:

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

Source: https://www.topbots.com/key-updates-to-transformer-architecture-2020/?utm_source=rss&utm_medium=rss&utm_campaign=key-updates-to-transformer-architecture-2020

spot_img

Latest Intelligence

spot_img