Nvidia announces TensorRT 8, slashes BERT inference times down to a millisecond

TensorRT is Nvidia's deep learning SDK that enables applications to perform up to 40x faster than CPU-only platforms during inference. With CUDA's parallel programming model, TensorRT allows you to optimize neural network models, calibrate for lower precision with high accuracy, and deploy your models for research and commercial use cases.

Today, Nvidia launched the 8th generation of TensorRT. Dubbed TensorRT 8, the latest iteration of the SDK brings with it a fleet of updates and advances that will allow developers and businesses to optimize and deploy their deep learning workflows and products on the web.

In deployment and commercial use, inference times for deep learning models can create bottlenecks, especially for large transformer models like BERT and GPT-3. To mitigate such issues, developers resort to decreasing model parameters/size. But this leads to loss of accuracy and decreased quality in the downsized model.

Using TensorRT 8, in an industry first, Nvidia clocked an inference time of 1.2 milliseconds on BERT-Large, which is one of the most commonly used language models today. Compared to the last generation of TensorRT, an inference time of 1.2 ms is 2.5x faster on Nvidia's V100 GPU. TensorRT 8's record-setting inference time should allow businesses to use larger models of such language models without worrying too much about the computing power and inference times.

At the heart of this rapid inference speed lie two key advancements. Firstly, TensorRT 8 utilizes a performance technique known as Sparsity that speeds up neural network inference by reducing computational operations. Sparsity, supported on Nvidia's Ampere architecture focuses on non-zero entries in the hidden layers of the neural network, essentially pruning the entries that do not affect the flow of a tensor through the network. This reduces the number of operations required to compute the answer in a forward pass, allowing for quick inference times during deployment.

The second technique, dubbed Quantization Aware Training (QAT), allows developers to use trained models to run inference in INT8 precision without losing accuracy. Compared to INT32 and INT16 precisions, which are typically used for training and/or deployment, INT8 provides faster computations by reducing the precision of the numbers, which in turn reduces the compute and storage overhead on tensor cores.

Sparsity, QAT, and other model-specific optimizations baked into TensorRT 8 cumulatively lead to 2x the performance of its predecessor TensorRT 7. And while the use of INT8 for speeding up inference is not a new concept, QAT improves the calculation accuracy of INT8 by 2x compared to the last generation.

Another exciting news for developers is that Huggingface, the renowned and ubiquitous open-source library for transformer models will officially support TensorRT 8, clocking the 1 ms inference times sometime later this year.

“AI models are growing exponentially more complex, and worldwide demand is surging for real-time applications that use AI. That makes it imperative for enterprises to deploy state-of-the-art inferencing solutions,” said Greg Estes, Vice President of Developer Programs at Nvidia. “The latest version of TensorRT introduces new capabilities that enable companies to deliver conversational AI applications to their customers with a level of quality and responsiveness that was never before possible.”