Large Language Models: RoBERTa — A Robustly Optimized BERT Approach

Learn about key techniques used for BERT optimisation

Vyacheslav Efimov

Published in

Towards Data Science

5 min readSep 24, 2023

Introduction

The appearance of the BERT model led to significant progress in NLP. Deriving its architecture from Transformer, BERT achieves state-of-the-art results on various downstream tasks: language modeling, next sentence prediction, question answering, NER tagging, etc.

Large Language Models: BERT — Bidirectional Encoder Representations from Transformer

Understand how BERT constructs state-of-the-art embeddings

towardsdatascience.com

Despite the excellent performance of BERT, researchers still continued experimenting with its configuration in hopes of achieving even better metrics. Fortunately, they succeeded with it and presented a new model called RoBERTa — Robustly Optimised BERT Approach.

Throughout this article, we will be referring to the official RoBERTa paper which contains in-depth information about the model. In simple words, RoBERTa consists of several independent improvements over the original BERT model — all of the other principles including the architecture stay the same. All of the advancements will be covered and explained in this article.

1. Dynamic masking

From the BERT’s architecture we remember that during pretraining BERT performs language modeling by trying to predict a certain percentage of masked tokens. The problem with the original implementation is the fact that chosen tokens for masking for a given text sequence across different batches are sometimes the same.

More precisely, the training dataset is duplicated 10 times, thus each sequence is masked only in 10 different ways. Keeping in mind that BERT runs 40 training epochs, each sequence with the same masking is passed to BERT four times. As researchers found, it is slightly better to use dynamic masking meaning that masking is generated uniquely every time a sequence is passed to BERT. Overall, this results in less duplicated data during the training giving an opportunity for a model to work with more various data and masking patterns.

2. Next sentence prediction

The authors of the paper conducted research for finding an optimal way to model the next sentence prediction task. As a consequence, they found several valuable insights:

Removing the next sentence prediction loss results in a slightly better performance.
Passing single natural sentences into BERT input hurts the performance, compared to passing sequences consisting of several sentences. One of the most likely hypothesises explaining this phenomenon is the difficulty for a model to learn long-range dependencies only relying on single sentences.
It more beneficial to construct input sequences by sampling contiguous sentences from a single document rather than from multiple documents. Normally, sequences are always constructed from contiguous full sentences of a single document so that the total length is at most 512 tokens. The problem arises when we reach the end of a document. In this aspect, researchers compared whether it was worth stopping sampling sentences for such sequences or additionally sampling the first several sentences of the next document (and adding a corresponding separator token between documents). The results showed that the first option is better.

Ultimately, for the final RoBERTa implementation, the authors chose to keep the first two aspects and omit the third one. Despite the observed improvement behind the third insight, researchers did not not proceed with it because otherwise, it would have made the comparison between previous implementations more problematic. It happens due to the fact that reaching the document boundary and stopping there means that an input sequence will contain less than 512 tokens. For having a similar number of tokens across all batches, the batch size in such cases needs to be augmented. This leads to variable batch size and more complex comparisons which researchers wanted to avoid.

3. Increasing batch size

Recent advancements in NLP showed that increase of the batch size with the appropriate decrease of the learning rate and the number of training steps usually tends to improve the model’s performance.

As a reminder, the BERT base model was trained on a batch size of 256 sequences for a million steps. The authors tried training BERT on batch sizes of 2K and 8K and the latter value was chosen for training RoBERTa. The corresponding number of training steps and the learning rate value became respectively 31K and 1e-3.

It is also important to keep in mind that batch size increase results in easier parallelization through a special technique called “gradient accumulation”.

4. Byte text encoding

In NLP there exist three main types of text tokenization:

Character-level tokenization
Subword-level tokenization
Word-level tokenization

The original BERT uses a subword-level tokenization with the vocabulary size of 30K which is learned after input preprocessing and using several heuristics. RoBERTa uses bytes instead of unicode characters as the base for subwords and expands the vocabulary size up to 50K without any preprocessing or input tokenization. This results in 15M and 20M additional parameters for BERT base and BERT large models respectively. The introduced encoding version in RoBERTa demonstrates slightly worse results than before.

Nevertheless, in the vocabulary size growth in RoBERTa allows to encode almost any word or subword without using the unknown token, compared to BERT. This gives a considerable advantage to RoBERTa as the model can now more fully understand complex texts containing rare words.

Pretraining

Apart from it, RoBERTa applies all four described aspects above with the same architecture parameters as BERT large. The total number of parameters of RoBERTa is 355M.

RoBERTa is pretrained on a combination of five massive datasets resulting in a total of 160 GB of text data. In comparison, BERT large is pretrained only on 13 GB of data. Finally, the authors increase the number of training steps from 100K to 500K.

As a result, RoBERTa outperforms BERT large on XLNet large on the most popular benchmarks.

RoBERTa versions

Analogously to BERT, the researchers developed two versions of RoBERTa. Most of the hyperparameters in the base and large versions are the same. The figure below demonstrates the principle differences:

The fine-tuning process in RoBERTa is similar to the BERT’s.

Conclusion

In this article, we have examined an improved version of BERT which modifies the original training procedure by introducing the following aspects:

dynamic masking
omitting the next sentence prediction objective
training on longer sentences
increasing vocabulary size
training for longer with larger batches over data

The resulting RoBERTa model appears to be superior to its ancestors on top benchmarks. Despite a more complex configuration, RoBERTa adds only 15M additional parameters maintaining comparable inference speed with BERT.

Resources

RoBERTa: A Robustly Optimized BERT Pretraining Approach

All images unless otherwise noted are by the author

Large Language Models: RoBERTa — A Robustly Optimized BERT Approach

Learn about key techniques used for BERT optimisation

Introduction

Large Language Models: BERT — Bidirectional Encoder Representations from Transformer

Understand how BERT constructs state-of-the-art embeddings

1. Dynamic masking

2. Next sentence prediction

3. Increasing batch size

4. Byte text encoding

Pretraining

RoBERTa versions

Conclusion

Resources

Written by Vyacheslav Efimov

More from Vyacheslav Efimov and Towards Data Science

Reinforcement Learning, Part 1: Introduction and Main Concepts

Making the first step into the world of reinforcement learning

A New Coefficient of Correlation

What if you were told there exists a new way to measure the relationship between two variables just like correlation except possibly…

Meet the NiceGUI: Your Soon-to-be Favorite Python UI Library

Build custom web apps easily and quickly

Reinforcement Learning, Part 2: Policy Evaluation and Improvement

From data to decisions: maximizing rewards with policy improvement methods for optimal strategies

Recommended from Medium

Mastering BERT: A Comprehensive Guide from Beginner to Advanced in Natural Language Processing…

Introduction: A Guide to Unlocking BERT: From Beginner to Expert

Mastering Text Classification with BERT: A Comprehensive Guide

Understand how to build advanced classifiers with fine-tuning BERT and its variants.

Lists

Predictive Modeling w/ Python

Natural Language Processing

Practical Guides to Machine Learning

The New Chatbots: ChatGPT, Bard, and Beyond

Large Language Models: DeBERTa — Decoding-Enhanced BERT with Disentangled Attention

Exploring the advanced version of the attention mechanism in Transformers

Building LLM Applications: Sentence Transformers (Part 3)

Learn Large Language Models ( LLM ) through the lens of a Retrieval Augmented Generation ( RAG ) Application.

Fine Tune Large Language Model (LLM) on a Custom Dataset with QLoRA

The field of natural language processing has been revolutionized by large language models (LLMs), which showcase advanced capabilities and…

Exploring BERT: Feature extraction & Fine-tuning

An introduction on BERT, one of the first Transformer-based LLMs, and examples of how it can be used in common NLP applications.