Microsoft unveils "fifth most powerful" supercomputer in the world

Last year, Microsoft entered a partnership with OpenAI, investing $1 billion into the research firm. Through this collaboration, the Redmond giant planned to develop Azure AI capabilities in large-scale systems.

At its Build 2020 developer conference today, Microsoft has unveiled the fifth most powerful publicly recorded supercomputer in the world. Built exclusively with and for OpenAI as a product of the aforementioned collaboration, the supercomputer hosted in Azure helps specifically support the training of large-scale AI models.

Microsoft CTO Kevin Scott expressed delight at reaching this milestone, noting:

"The exciting thing about these models is the breadth of things they’re going to enable. This is about being able to do a hundred exciting things in natural language processing at once and a hundred exciting things in computer vision, and when you start to see combinations of these perceptual domains, you’re going to have new applications that are hard to even imagine right now."

The key difference between the newer types of learning models to many of the others developed by the AI research community are that these excel at handling a variety of tasks at the same time, such as different components involving language, grammar, context, and more. Microsoft Turing models and supercomputing resources are planned to be made available through Azure AI services and GitHub to assist developers in leveraging their power.

The supercomputer developed by Microsoft and OpenAI to enable the training of such large scale models hosts over 285,000 CPU cores, 10,000 GPUs, and 400Gb/s network connectivity for each GPU server. It also features a "robust modern cloud infrastructure" along with access to Azure services, rapid deployment, sustainable datacenters, and more. Microsoft has compared it to other systems on the TOP500 supercomputers rankings list in order to back its "top five" statement.

The Microsoft Turing model for natural language generation utilizes 17 billion parameters, allowing the process of "self-supervised" learning to be carried out in a much more nuanced manner. Similarly, these next-generation AI models also offer another advantage in that they only need to be trained once with a huge amount of data and supercomputing resources. For different tasks, these can simply be fine-tuned using much smaller datasets.

On the advancement to this "AI at Scale" initiative, Scott noted:

"By developing this leading-edge infrastructure for training large AI models, we're making all of Azure better. We're building better computers, better distributed systems, better networks, better datacenters. All of this makes the performance and cost and flexibility of the entire Azure cloud better."

Other related announcements include the introduction of a new open source deep learning library for PyTorch, named DeepSpeed, and distributed training support for ONNX Runtime. The former reduces computing power required to train large distributed models, while the latter adds support for model training and up to 17 times performance improvements to the current version of ONNX Runtime.