LaBSE, a new language-agnostic embedding model, supports 109 languages with SOTA accuracy

The fields of natural language processing (NLP) and natural language generation (NLG) have benefited greatly from the inception of the transformer architecture. Transformer models like BERT and its derivatives have been applied to a range of domains including sentiment analysis and classification.

In recent years, significant effort has gone into making these models even more robust, particularly by extending masked language model (MLM) pre-training and combining it with translation language modeling (TLM) to make the models language-agnostic. While this nexus of MLM and TLM have proved helpful with fine-tuning on downstream tasks, thus far, they have not directly produced multilingual sentence embeddings, which are critical for translation tasks.

With this in mind, researchers at Google have now debuted a multilingual BERT embedding model called "Language-agnostic BERT Sentence Embedding”, or LaBSE for short, which produces language-agnostic cross-lingual sentence embeddings for 109 languages on a single model. Succinctly, LaBSE combines the venerable MLM and TLM pre-training on a 12-layer transformer housing a vocabulary of 500,000 tokens with a translation ranking task using bi-directional dual encoders.

*LaBSE's dual-encoder architecture. Image via Google AI*

To train the model, the researchers used 17 billion monolingual sentences and 6 billion bilingual sentence pairs. Once trained, LaBSE was evaluated using the Tatoeba corpus whereby the model was tasked with finding the nearest neighbor translation for a given sentence using the cosine distance.

Resultantly, the model demonstrated that it is effective even on low-resource languages for which there is no data available during training. In addition to this, the LaBSE also established a new state of the art (SOTA) on multiple parallel text or bitext retrieval tasks. Specifically, as the number of languages increased, traditional models like m~USE and LASER models, demonstrated a sharper decline in average accuracy in comparison to LaBSE.

... reduction in accuracy from the LaBSE model with increasing numbers of languages is much less significant, outperforming LASER significantly, particularly when the full distribution of 112 languages is included (83.7% accuracy vs. 65.5%).

The potential applications of LaBSE include mining parallel text from the web. The researchers applied it to CommonCrawl to find a potential translation from a pool of 7.7 billion English sentences pre-processed and encoded by LaBSE. With these embeddings in place, the translation model demonstrated an impressive accuracy reaching BLEU scores of 35.7 and 27.2, which "is only a few points away from current state-of-art-models trained on high-quality parallel data," Google wrote.

The pre-trained model is now available for use on TensorFlow Hub. It can be used out of the box or can be fine-tuned to a dataset of your own liking. If you are interested in further details, you may study the original research paper here.