Microsoft shows how it combines Azure with NVIDIA chips to make AI supercomputers

Image of Microsoft Supercomputer

Microsoft is promoting its efforts to create supercomputers using its Azure cloud computing program to help OpenAI with its ChatGPT chatbot. At the same time, it also announced a new AI virtual machine that used upgraded GPUs from NVIDIA.

The new ND H100 v5 VM from Microsoft use NVIDIA's H100 GPUs, an upgrade from the previous A100 GPUs. Companies that need to add AI features can access this virtual machine service that has the following features:

8x NVIDIA H100 Tensor Core GPUs interconnected via next gen NVSwitch and NVLink 4.0

400 Gb/s NVIDIA Quantum-2 CX7 InfiniBand per GPU with 3.2Tb/s per VM in a non-blocking fat-tree network

NVSwitch and NVLink 4.0 with 3.6TB/s bisectional bandwidth between 8 local GPUs within each VM

4th Gen Intel Xeon Scalable processors

PCIE Gen5 host to GPU interconnect with 64GB/s bandwidth per GPU

16 Channels of 4800MHz DDR5 DIMMs

This is in addition to Microsoft's previously announced ChatGPT in Azure OpenAI Service, which lets third parties access the chatbot tech through Azure.

In a separate blog post, Microsoft talks about how the company first started working with OpenAI to help create the supercomputers that are needed for ChatGPT's large language model (and for Microsoft's own Bing Chat). That meant linking up thousands of GPUs together in an all-new way. The blog offered up an explanation from Nidhi Chappell, Microsoft's head of product for Azure high performance computing and AI:

To train a large language model, she explained, the computation workload is partitioned across thousands of GPUs in a cluster. At certain phases in this computation – called allreduce – the GPUs exchange information on the work they’ve done. An InfiniBand network accelerates this phase, which must finish before the GPUs can start the next chunk of computation.

This hardware is combined with software to help optimize the use of both the NVIDIA GPUs and the network that keeps all of them working together. Microsoft says it is continuing to add GPUs and expanding its network while also trying to keep them working 24/7 via cooling systems, backup generators and uninterruptible power supply systems.