NVIDIA registers the world's quickest BERT training time and largest transformer-based model

NVIDIA Corporation, the behemoth in the world of graphics processing units (GPUs), announced today that it had clocked the world's fastest training time for BERT-Large at 53 minutes and also trained GPT-2 8B, which is the world's largest transformer-based Natural Language Processing (NLP) model boasting a whopping 8.3 billion parameters.

To achieve the unprecedented results, the California-based tech company employed its DGX SuperPOD, a supercomputer that houses 96 NVIDIA DGX-2H servers containing 1,536 NVIDIA Tesla V100 SXM3 GPUs. For the casual reader, putting aside the jargon, NVIDIA utilized a machine that is incredibly powerful and is capable of processing computationally onerous tasks rather efficiently.

The NVIDIA DGX SuperPOD. Image via NVIDIA Developer Blogs

BERT is a state-of-the-art NLP network that is perfect for language understanding tasks like sentiment analysis, sentence classification, Q&As, and translation. A key advantage of this network is that it does not need labeled data for pre-training. For example, BERT is typically pre-trained on a dataset comprising of approximately 3.3 billion words. NVIDIA trained the BERT-large network in under 53 minutes. The official blog wrote:

The NVIDIA DGX SuperPOD with 92 DGX-2H nodes set a new record by training BERTLARGE in just 53 minutes. This record was set using 1,472 V100 SXM3-32GB GPUs and 10 Mellanox Infiniband adapters per node, running PyTorch with Automatic Mixed Precision to accelerate throughput.

Another category of transformer-based NLP networks is used for generative language modeling. These models are designed to predict and subsequently generate text based on the predictions. Think of an algorithm that writes an entire email based on the first paragraph. In an effort to create larger transformer-based models of this category for NLP, NVIDIA's Project Megatron scaled the 1.5 billion parameter model GPT-2 to a model that is 24 times the size of BERT and 5.6 times larger than its predecessor. The resultant model, called GPT-2 8B, is comprised of 8.3 billion parameters.

NVIDIA trained the resultant GPT-2 8B model in native PyTorch with the details as follows:

The model was trained using native PyTorch with 8-way model parallelism and 64-way data parallelism on 512 GPUs. GPT-2 8B is the largest Transformer-based language model ever trained, at 24x the size of BERT and 5.6x the size of GPT-2.

These models and the supercomputers used to train them have accrued significant support in the domain of NLP as they are able to handle massive amounts of datasets and make accurate predictions. Optimizations to accelerate the training of BERT and other transformer-based models are accessible for free at NVIDIA GPU Cloud. You can further explore the specific details in the original blog post.