Meta Is Building the World’s Fastest AI Supercomputer

The machine will be capable of training AI models on real-world data sourced from the company’s platforms.

January 24, 2022

4 Min Read

The m-word

Meta has been involved in AI research for more than a decade. It established the Facebook AI Research lab (FAIR) in 2013, which went on to develop tools for chatbot design, methods for making AI systems forget unnecessary information, and ‘synthetic skin’ that gives robots the sense of touch.

The lab’s most important contribution to the field is undoubtedly PyTorch, an open source deep learning framework that emerged as something of a standard and is now widely used by developers and data scientists across a variety of platforms.

Meta launched its first dedicated AI supercomputer in 2017, built with 22,000 Nvidia V100 GPUs.

The machine is being considerably outclassed by its successor, with Meta claiming RSC already delvers three times more performance in large scale NLP workflows, using less than half of its final hardware footprint.

The first phase of the project consists of 760 Nvidia DGX A100 server systems with a total of 6,080 GPUs, connected using Nvidia’s Quantum 200 Gb/s InfiniBand fabric.

The storage tier is equipped with 185PB of all-flash memory from Pure Storage, and 46PB of cache storage spread across Penguin Computing Altus servers. Training data is delivered through FAIR’s own purpose-built storage service called the AI Research Store (AIRStore).

Once the RSC is complete, the same InfiniBand fabric will connect 16,000 GPUs, making this the largest DGX A100 deployment to date. It will be served by a caching and storage system with 16 TB/s of bandwidth, and is expected to deliver nearly 5 exaflops of mixed precision compute.

“We wanted this infrastructure to be able to train models with more than a trillion parameters on data sets as large as an exabyte — which, to provide a sense of scale, is the equivalent of 36,000 years of high-quality video,” Facebook’s technical program manager Kevin Lee and software engineer Shubho Sengupta said in a post on the company’s blog.

Unlike its previous supercomputer, which leveraged only open source and publicly available data sets, Meta’s new machine will be using real-world training data obtained directly from the users of the company’s platforms.

For this reason, Meta says the RSC has been designed from the ground up with privacy and security in mind: the supercomputer is isolated from the Internet, with no direct inbound or outbound connections, and traffic can flow only from Meta’s production data centers. User data is anonymized, and the entire data path from the storage systems to the GPUs is encrypted.

“We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together,” Lee and Sengupta said.

“Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the metaverse, where AI-driven applications and products will play an important role.”

The authors said that the RSC will also be used to help better identify “harmful content” – Meta’s recent advances in this area include the introduction of few-shot learning (FSL) to more easily detect posts that attempt to breach its policy in new and unexpected ways.

Supply chain blues

The ongoing chip supply shortage has affected countless infrastructure projects, and the RSC was no exception.

“RSC began as a completely remote project that the team took from a simple shared document to a functioning cluster in about a year and a half,” Lee and Sengupta said.

“COVID-19 and industry-wide wafer supply constraints also brought supply chain issues that made it difficult to get everything from chips to components like optics and GPUs, and even construction materials — all of which had to be transported in accordance with new safety protocols.

“To build this cluster efficiently, we had to design it from scratch, creating many entirely new Meta-specific conventions and rethinking previous ones along the way. We had to write new rules around our data center designs — including their cooling, power, rack layout, cabling, and networking (including a completely new control plane), among other important considerations.”

About the Author(s)

Max Smolaks

Senior Editor, Informa

Max Smolaks is senior editor at Data Center Knowledge, a leading online publication dedicated to the data center industry. A passionate technology journalist, Max has been writing about IT for a decade, covering startups, hardware, and regulation – across B2B titles including Silicon, DatacenterDynamics, The Register, and AI Business.

https://www.linkedin.com/in/max-smolaks/

See more from Max Smolaks

Related Topics

Recent in Infrastructure

Related Topics

Recent in Build & Design

Related Topics

Recent in Ops & Mgmt

Related Topics

Recent in Business

Related Topics

Recent in Security

Related Topics

Recent in Next-Gen

Related Topics

Recent in Sustainability

Related Topics

Meta Is Building the World’s Fastest AI Supercomputer

The m-word

Supply chain blues

About the Author(s)

Editor's Choice

Industry Voices

Featured How Tos

Related Topics

Recent in Infrastructure

Related Topics

Recent in Build & Design

Related Topics

Recent in Ops & Mgmt

Related Topics

Recent in Business

Related Topics

Recent in Security

Related Topics

Recent in Next-Gen

Related Topics

Recent in Sustainability

Related Topics

<span class="ArticleBase-LargeTitle">Meta Is Building the World’s Fastest AI Supercomputer</span>Meta Is Building the World’s Fastest AI Supercomputer

The m-word

Supply chain blues

About the Author(s)

Editor's Choice

Industry Voices

Featured How Tos

Meta Is Building the World’s Fastest AI Supercomputer