Microsoft Says It’s Built a Top-Five Supercomputer in the Cloud for AI

Future machine learning models will be enormous. The OpenAI system shows that hyperscale cloud platforms can take them on.

Mary Branscombe

May 21, 2020

5 Min Read
OpenAI Co-Founder and CEO Sam Altman speaks onstage during TechCrunch Disrupt San Francisco 2019.
OpenAI Co-Founder and CEO Sam Altman speaks onstage during TechCrunch Disrupt San Francisco 2019.Steve Jennings/Getty Images for TechCrunch

It doesn’t have a clever name or an official Top500 ranking, but the supercomputer Microsoft built on Azure as a “dream system” for the non-profit AI research group OpenAI shows that cloud is a serious contender for the most demanding compute workloads, as AI models become extremely large and complex.

It took Microsoft – a major investor in the non-profit co-founded among others by Elon Musk – only six months to design and deploy the supercomputer. That’s a much faster design cycle than for most supercomputers, and it was possible because it leveraged the Azure infrastructure (and supply chain), combining “the best of a dedicated appliance and a hyperscale cloud infrastructure,” Microsoft CTO, Kevin Scott, said.

Speaking this week during the company’s Build conference, held virtually, Scott described the supercomputer as “a single system with more than 285,000 CPU cores, 10,000 GPUs, and 400 gigabits per second of network connectivity for each GPU server."

The OpenAI system hasn’t been formally submitted for assessment to Top500, the organization that ranks the world’s fastest supercomputers. But Scott (whose career includes internships at both Cray and the US National Center for Supercomputing Applications) suggested in a press briefing that it would make the top five. “We can’t make definitive claims, but we believe that it's probably right around the fifth most powerful machine in the world,” he said.

Related:What’s the Best Computing Infrastructure for AI?

What Could Be Under the Hood?

Microsoft hasn’t shared any more specifics about the system’s design, but Scott said it emphasized fast interconnection between CPUs and GPUs for distributed training. It’s built with Nvidia GPUs, interconnected with the GPU maker’s NVLink technology through an NVSwitch within each node, and through InfiniBand across the nodes.

Microsoft handed it over to OpenAI last year, according to Scott.

A version of its building blocks is already available on Azure to the general public as NDv2 instances, which have eight interconnected V100 Tensor Core GPUs and 32GB of memory, designed to scale up to “supercomputer-sized clusters” with up to 800 GPUs (using a dedicated 100Gbps Ethernet storage network with NVMe SSD NFS storage). In fact, Microsoft and Nvidia announced the offering as a “GPU-accelerated supercomputer” in November. But the instances comprising the OpenAI system appear to have a different configuration.  

bld20 microsoft luis vargas supercomputer.png

Microsoft's Luis Vargas speaking during the Build 2020 virtual conference.

The NDv2 instances have Mellanox 100Gbps InfiniBand, and, as Moor Insights & Strategy senior analyst Karl Freund pointed out, the 40-core Xeon Platinum CPUs powering the NDv2 instances don’t add up to enough cores. Instead of the pricier 56-core Xeons, he suggested, Microsoft likely used the much cheaper 64-core AMD Epyc 2 CPUs codenamed “Rome.”

Related:Why Microsoft Is Betting on FPGAs for Machine Learning at the Edge

“This marks a significant accomplishment for Microsoft and OpenAI and should help advance research into the development of ever larger AI models,” Freund told DCK. “If in fact the system uses Nvidia V100 and AMD EPYC CPUs, and I am convinced it does, this would also make for a very cost-effective high-performance platform and could vault Microsoft Azure AI into a leadership position.”

OpenAI’s long-term goal is Artificial General Intelligence rather than the specific tasks that AI can achieve today. The supercomputer certainly won’t deliver that, but it will speed up the large-scale models emerging as state of the art for language translation, machine comprehension, and other areas of Natural Language Processing, or NLP. “This is the machine we're using to train these big self-supervised deep neural networks that we're using as platforms for language technologies and vision and speech technologies,” Scott explained in the press briefing.

In a Build keynote, OpenAI CEO Sam Altman showed off a system trained by Microsoft’s newly open sourced Turing natural language generation model, which can write code based on verbal descriptions.

“You're able to describe to the system code that you would like to be written, and then it is able to synthesize the code for you from the description,” Scott said.

Enormous Compute Scale for Enormous Models

The Turing model (also used in Bing and Office) has 17 billion parameters. Future models will have “hundreds of billions and eventually trillions of parameters,” Luis Vargas, Microsoft technical advisor to the CTO, predicted in the keynote. The ability to work with such large models will be limited not by how much data you have to train a model with but by the infrastructure you have access to – like the OpenAI supercomputer.

“This looks very much like the real deal, from a massive-AI perspective,” Nick McQuire, senior VP at CCS Insight, told DCK.

“What we are seeing is that the advances in machine learning, and particularly NLP, are rapid, and models are getting larger, more complex, and pulling on a higher level of infrastructure,” he said. “When you combine techniques like transfer learning with converging AI disciplines like speech and vision – as the AGI research field that Open AI focuses on does – the need for the type of computing power that Microsoft is talking about becomes significant.”

bld20 microsoft turing.png

bld20 microsoft turing

Azure has all that, plus reliability and green credentials. “These supercomputers have to be carbon-neutral, given the power consumption we are talking about,” McQuire said. Microsoft CEO “Satya Nadella’s carbon pledges very much tie into this. Additionally, they need to be proven to work reliably at scale, which is why Microsoft’s Azure footprint – and above all SaaS applications that consume AI resources of this magnitude – are critical.”

Those points will matter to customers when the market is ready for AI applications that operate at such scale. But getting to that point could take some time.

Few organizations beyond OpenAI (and Microsoft itself) have an immediate need for AI supercomputers of this scale. McQuire suggested that Microsoft could be sending signals about its capabilities relative to its rivals to the AI developer community. “I think there is a perception that Google is leading the arms race in AI based on its raw capabilities and talent, with DeepMind for instance. But this area is now being driven by Microsoft and OpenAI.”

Subscribe to the Data Center Knowledge Newsletter
Get analysis and expert insight on the latest in data center business and technology delivered to your inbox daily.

You May Also Like