BLEURT is a metric for natural language generators boasting unprecedented accuracy

Researchers at Google have developed BLUERT, an automatic metric that gauges the performance of natural language generation models and delivers SOTA performance on two academic benchmarks.

Ather Fawaz Neowin @AtherFawaz · May 27, 2020 03:50 EDT

Natural Language Generation (NLG) models have seen tremendous growth in the last few years. Their ability to summarize and complete text has improved significantly. So has their ability to engage in conversations. One example of this would be OpenAI's model that the company decided against releasing because it had the potential to spew fake news.

Given the rapid growth of NLG models, quantifying their performance has become an important research question. Generally, there are two ways to measure the performance of these models: human evaluation and automatic metrics like the bilingual evaluation understudy (BLEU). Both have their merits and demerits—the former is labor-intensive while the latter often pales in comparison to the former's accuracy.

Considering this, researchers at Google have developed BLEURT, a novel automatic metric for natural language models that delivers ratings that are robust and reach an unprecedented level of quality close to human metrics.

At the heart of BLEURT is machine learning. And for any machine learning model, perhaps the most important commodity is the data it trains on. However, the training data for an NLG performance metric is limited. Indeed, on the WMT Metrics Task dataset, which is currently the largest collection of human ratings, contains approximately 260,000 human ratings apropos the news domain only. If it were to be used as the sole training dataset, the WMT Metric Task dataset would lead to a loss of generality and robustness of the trained model.

To help deal with this issue, the researchers employed transfer learning. To begin with, the team used the contextual word representations of BERT since it has already been successfully incorporated into NLG metrics like YiSi and BERTscore. Following this, the researchers introduced a novel pre-training scheme to increase BLEURT's robustness and accuracy. Likewise, it also helped the model cope with quality drift.

BLEURT relies on “warming-up” the model using millions of synthetic sentence pairs before fine-tuning on human ratings. We generated training data by applying random perturbations to sentences from Wikipedia. Instead of collecting human ratings, we use a collection of metrics and models from the literature (including BLEU), which allows the number of training examples to be scaled up at very low cost.

The researchers pre-trained BLEURT twice. In the first phase, their objective was language modeling while in the second phase, their objective was evaluating the NLG models. After this, the team fine-tuned the model on the WMT Metrics dataset.

Once trained, BLEURT was pitted against competing approaches and was shown to be superior to current metrics. It also correlates well with human ratings. Concretely, BLEURT was shown to be approximately 48% more accurate than BLEU on the WMT Metrics Shared Task of 2019. Other results are summarized below.

BLEURT is an attempt to "capture NLG quality beyond surface overlap", the researchers wrote. The metric yields SOTA performance on two academic benchmarks. For the future, the team is investigating its potential in improving Google products and will be investigating multilinguality and multimodality in subsequent researches.

BLUERT runs in Python 3, relies on TensorFlow, and is available on GitHub. If you are interested in finding out more, you may study the paper published on arXiv.

Tags

Trending Stories

GEEKOM Air12 2026 Edition: a refresh on what could be your next Cloud PC

Microsoft Weekly: new Surface, Windows 11 26H2, and more

EA Sports UFC 6: Brutal, satisfying, and surprisingly accessible to newcomers

AMD RX 9070 GRE benchmarked vs 9070 XT, 7800 XT and Nvidia RTX 5070, 4070

DWARF mini: the world's smallest smart telescope for night and day sky captures

007 First Light: Satisfying spy adventure that James Bond needed

Killing uBlock Origin bypasses, Euro Office faces fire, and will AI replace you?

DuRoBo Krono: E-Ink reader with great ideas that need a bit of improvement

AMD RX 9070 GRE benchmarked in AI, Blender, vs 9070 XT, 7800 XT, Nvidia RTX 5070, 4070

AI is causing headache for Linux and laying the groundwork for legal issues

I never knew Windows had a "hidden" collection of tools

Apple MacBook Neo is a blessing for Windows users, here's why

Login