Few people on the planet know more about building computers for Artificial Intelligence than Rob Ober. As the top technology exec at Nvidia’s Accelerated Computing Group, he’s the chief platform architect behind Tesla, the most powerful GPU on the market for Machine Learning, which is the most widespread type of AI today.
GPUs, or Graphics Processing Units, take their name from their original purpose, but their applications today stretch far beyond that. Supercomputer designers have found them ideal for offloading huge chunks of workloads from CPUs in the systems they build; they’ve also proven to be super-efficient processors for a Machine Learning approach called Deep Learning. That’s the type of AI Google uses to serve targeted ads and Amazon Alexa taps into for instantaneous answers to voice queries.
Creating algorithms that enable computers to learn by observation and iteration is undoubtedly complex; also incredibly complex is designing computer systems to execute those instructions and data center infrastructure to power and cool those systems. Ober has seen this firsthand working with Nvidia’s hyper-scale customers on their data center systems for deep learning.
“We’ve been working with a lot of the hyper-scales – really all of the hyper-scales – in the large data centers,” he said in an interview with Data Center Knowledge. “It’s a really hard engineering problem to build a system for GPUs for deep learning training. It’s really, really hard. Even the big guys like Facebook and Microsoft struggled.”
Big Basin, Facebook's latest AI server. Each of the eight heat sinks hides a GPU. (Photo: Facebook)
It Takes a Lot of Power to Train an AI
Training is one type of computing workload involved in deep learning (or rather a category of workloads, since the field is evolving, and there are several different approaches to training). Its purpose is to teach a deep neural network -- a network of computing nodes aiming to mimic the way neurons interact in the human brain -- a new capability from existing data. For example, a neural net can learn to recognize dogs in photos by repeatedly “looking” at various images that have dogs in them, where dogs are tagged as dogs.
The other category of workloads is inference, which is where a neural net applies its knowledge to new data (e.g. recognizes a dog in an image it hasn’t seen before).
Nvidia makes GPUs for both categories, but training is the part that’s especially difficult in the data center, because hardware for training requires extremely dense clusters of GPUs, or interconnected servers with up to eight GPUs per server. One such cabinet can easily require 30kW or more -- power density most data centers outside of the supercomputer realm aren’t designed to support. Even though that’s the low end of the range, about 20 such cabinets need as much power as the Dallas Cowboys jumbotron at the AT&T stadium, the world’s largest 1080p video display, which contains 30 million lightbulbs.
“We put real stresses on a lot of data center infrastructure,” Ober said about Nvidia’s GPUs. “With deep learning training you typically want to make as dense a compute pool as possible, and that becomes incredibly power-dense, and that’s a real challenge.” Another problem is controlling the voltage in these clusters. GPU computing, by its nature, produces lots of power transients (sudden spikes in voltage), “and those are difficult to deal with.”
Interconnecting the nodes is another big challenge. “Depending on where your training data comes from it can be an incredible load on the data center network,” Ober said. “You can be creating a real intense hot spot.” Power density and networking are probably the two biggest design challenges in data center systems for deep learning, according to him.
Tesla P100, Nvidia's most powerful GPU (Image: Nvidia)
Cooling the Artificial Brain
Hyper-scale data center operators – the likes of Facebook and Microsoft – mostly address the power density challenge by spreading their deep learning clusters over many racks, although some “dabble” in liquid cooling or liquid-assist, Ober said. Liquid cooling is when chilled water is delivered directly to the chips on the motherboard (a common approach to cooling supercomputers), while liquid-assist cooling is when chilled water is brought to a heat exchanger attached to an IT cabinet to cool air that is then pushed through the servers.
Not everybody that needs to support high-density deep learning hardware has the luxury of hundreds of thousands of square feet of data center space, and those who don’t, such as the few data center providers that have chosen to specialize in high density, have gone the liquid-assist route. Recently, these providers have seen a spike in demand for their services, driven to a large extent by the growing interest in machine learning.
Both startups and large companies are looking for ways to leverage the technology that is widely predicted to drive the next big wave of innovation, but most of them don’t have the infrastructure necessary to support this development work. “Right now the GPU-enabled workloads are the ones where we’re seeing the largest amount of growth, and it’s definitely the enterprise sector,” Chris Orlando, co-founder of high-density data center provider ScaleMatrix, said in an interview. “The enterprise data center is not equipped for this.”
That spike in growth started only recently. Orlando said his company has seen a hockey stick-shaped growth trajectory with the knee somewhere around the middle of last year. Other applications driving the spike have been computing for life sciences and genomics (one of the biggest customers at ScaleMatrix’s flagship data center outside of San Diego, a hub for that types of research, is the genomics powerhouse J. Craig Venter Institute), geospacial research, and big data analytics. In Houston, its second data center location, most of the demand comes from the oil and gas industry whose exploration work requires some high-octane computing power.
Another major ScaleMatrix customer in San Diego is Cirrascale, a hardware maker and cloud provider that specializes in infrastructure for Deep Learning. Read our feature on Cirrascale here.
Inside ScaleMatrix's data center in San Diego (Photo: ScaleMatrix)
Each ScaleMatrix cabinet can support up to 52kW by bringing chilled water from a central plant to cool air in the fully enclosed cabinet. The custom-designed system’s chilled-water loop is on top of the cabinet, where hot exhaust air from the servers rises to get cooled and pushed back over the motherboards. Seeing growing enterprise demand for high-density computing, the company recently started selling this technology to companies interested in deploying it in-house.
Colovore, a data center provider in Silicon Valley, also specializes in high-density colocation. It is using the more typical rear-door heat exchanger to provide up to 20kW per rack in the current first phase, and 35kW in the upcoming second phase. At least one of its customers is interested in pushing beyond 35kW, so the company is exploring the possibility of a supercomputer-like system that brings chilled water directly to the motherboards.
Today a “large percentage” of Colovore’s data center capacity is supporting GPU clusters for machine learning, Sean Holzknecht, the company’s co-founder and president, said in an interview. Like ScaleMatrix, Colovore is in a good location for what it does. Silicon Valley is a hotbed for companies that are pushing the envelope in machine learning, self-driving cars, and bioinformatics, and there’s no shortage of demand for the boutique provider's high-density data center space.
Read our feature on Colovore and its niche play in Silicon Valley here.
A look beneath the floor tiles at Colovore displays the infrastructure to support water cooled doors. (Photo: Colovore)
Demand for AI Hardware Surging
And demand for the kind of infrastructure Colovore and ScaleMatrix provide is likely to continue growing. Machine learning is only in the early innings, and few companies outside of the large cloud platforms, the likes of Google, Facebook, Microsoft, and Alibaba, are using the technology in production. Much of the current activity in the field today consists of development, but that work still requires a lot of GPU horsepower.
Nvidia says demand for AI hardware is surging, a lot of it driven by enterprise cloud giants like Amazon Web Services, Google Cloud Platform, and Microsoft Azure, who offer both machine learning-enhanced cloud services and raw GPU power for rent. There’s hunger for the most powerful cloud GPU instances available. “The cloud vendors who currently have GPU instances are seeing unbelievable consumption and traction,” Nvidia’s Ober said. “It really is telling that people are drifting to the largest instances they can get.”