What the Arrival of AI Hardware Means for Your Data Center
The Data Center Podcast: Charlie Boyle, general manager of the Nvidia DGX unit, on AI hardware and data centers.
Four years ago, when Charlie Boyle was about a year into his job running the unit of Nvidia that makes and sells full AI hardware solutions, many IT and data center managers were intimidated by this new class of hardware.
Nvidia DGX systems – essentially AI supercomputers – are large, powerful, and gold colored. People that would have to support them in their data centers were worried about the hardware’s power density and intimidated by the presence of InfiniBand (an interconnect technology from the supercomputer world). Generally, they thought the systems would take some learning to get a handle on, and who in IT has time for that?
So, conversations Boyle used to have with customers often tended to start with him explaining that not only were these systems nothing to fear, but that they were designed to require no more, if not less, than the amount of IT personnel hours their typical enterprise servers required.
“To an IT administrator, it’s just a bigger Linux box,” he said. “They shouldn’t be scared of it. It’s not esoteric.”
We recently interviewed Charlie Boyle, an Nvidia VP and general manager of the company’s DGX Systems unit, for The Data Center Podcast. We talked about what the arrival of AI hardware means for an IT organization and its data center staff, the role liquid cooling will inevitably play in data centers of the future, Nvidia’s own experience operating a massive AI hardware cluster across multiple data centers, AI computing infrastructure in the cloud vs. on-prem vs. in colo, and more.
Listen to the entire conversation here or wherever you listen to podcasts, and be sure to subscribe:
Nvidia’s internal DGX cluster, referred to as “Saturn 5,” has now grown to about 3,000 systems. And it’s managed by “very few IT staff,” Boyle told us. The IT organization is “smaller than you would expect for an organization that’s running thousands of high-end servers. Because at the end of the day, they ’re all the same. It’s an IT administrator’s dream.”
AI Hardware in the Data Center
While for IT staff a DGX system is “just a bigger Linux box,” this infrastructure doesn’t necessarily mean business-as-usual for data center managers.
“Empty is the new green,” is how Nvidia spins the fact that a typical enterprise data center will have a lot of empty rack space around each DGX system deployed, because there likely won’t be enough power in a single rack to support more than one of these GPU-stuffed boxes.
“Empty” and “green” in this scenario means that while you’re left with a lot of empty space, the space that isn’t empty is occupied by a box that has enough compute to replace thousands of Intel x86 pizza-box servers, Boyle said.
A few years ago, he and his colleagues liked to say that a single DGX replaced a hundred or a few hundred x86 boxes. They no longer say that. “We don’t even make those comparisons anymore, because the number is into the thousands at this point,” he said.
Yes, a single DGX may take a lot more power than a few pizza-box servers occupying the equivalent amount of space, he said, but “with good planning, it doesn’t really matter… as long as you can fit at least one of these systems in a rack.”
The more space-efficient ways to deploy this hardware involve higher rack densities, including, ultimately, racks using some flavor of liquid cooling. The latter is something everyone designing a data center today should at least plan for being able to add in the future, Boyle advised.
Listen (and subscribe!) to The Data Center Podcast with Nvidia’s Charlie Boyle below, on Apple Podcasts, Spotify, Google Podcasts, , or wherever you listen to podcasts.
PODCAST TRANSCRIPT: Charlie Boyle, VP and GM Nvidia DGX Systems
Yevgeniy Sverdlik, Data Center Knowledge:
Hey, everybody. Welcome to the Data Center Podcast. This is Yevgeniy, editor in chief of Data Center Knowledge. Today we're lucky to have with us Charlie Boyle. He is a VP at Nvidia. There, he runs the DGX systems unit. That unit makes and sells Nvidia supercomputers for AI, lots of really cool stuff we're going to talk about today. Charlie, thank you so much for talking to us today.
Charlie Boyle, Nvidia:
Thank you and glad to be here.
Yevgeniy Sverdlik, Data Center Knowledge:
Let's give our listeners a bit of your background. You came to Nvidia five years ago, from some LinkedIn investigation that I've done, after six years at Oracle and you ended up at Oracle because Oracle acquired Sun Microsystems, your previous employer. Is that correct?
Charlie Boyle, Nvidia:
Yeah. Yeah. I've been in data center infrastructure servers, data center system management, for a lot of years at this point. I don't want to date myself too much but before Sun, I was running service provider data centers and doing engineering R&D for them and then started doing products at Sun, which transitioned over to Oracle and then, obviously, with Nvidia, we had a great opportunity to build AI systems from the ground up as a complete solution and that's really what I like to do is deliver a full solution to a user that's going to solve their problems and in a way that's different than what they've got today.
Yevgeniy Sverdlik, Data Center Knowledge:
Yeah. Let's rewind a little bit. How did you end up at Sun?
Charlie Boyle, Nvidia:
Back in my service provider days, as we were running what you would call managed services today for web hosting, ran data centers for thousands of servers worldwide and my data center platform was kind of split between Sun and Solaris boxes and at the time, this will date me a little bit, Windows NT boxes. I knew a lot of folks at Sun because we hosted thousands of Sun servers and had some of the biggest brands on the internet at the time running Sun and Solaris servers.
Charlie Boyle, Nvidia:
I had moved out to California for a startup to do something interesting in the voice data center space. Then as I was at a joint Sun event, they were promoting a new platform, a data center management platform, if you will, back in the day and this would have been 2002, that sounded really interesting and revolutionary and I knew enough people at Sun, I phoned up a few people and said, "Hey, that thing that this engineering VP just announced, I don't know him but I'm interested in it. Do any of my friends know this guy?" Some of my good friends at Sun were like, "Oh, yeah, I know him well. I'm his marketing counterpart, I'm his product counterpart."
Charlie Boyle, Nvidia:
We made introductions and joined Sun to work on converged data center platform software, at the time, and had a great career there doing that, doing system management, virtualization, and then also eventually running the Solaris product at Sun that at the time was widely used across the entire high-end unit space.
Yevgeniy Sverdlik, Data Center Knowledge:
That was, the product that we're talking about, that made you curious about joining Sun was the converged data center?
Charlie Boyle, Nvidia:
Yeah. At the time, it's a product that probably nobody currently has ever heard of. It was called N1 was the initiative. We had a couple acquisitions. We learned a lot. Back in the day, it was a grand vision of a single product managing everything in the data center from servers, network storage and easy control, plain. I think it was a little early in its time back in the early 2000s. That's a really tough problem to crack.
Charlie Boyle, Nvidia:
We learned a lot along the way from our customers and eventually spun out a number of smaller scope products that really helped customers and then leading to a bunch of data center management setups, including virtualized systems and, at the time, how do you use the built-in virtualization inside of the Solaris operating system better across the data center?
Yevgeniy Sverdlik, Data Center Knowledge:
Similar in spirit to what the actual converged systems that came after that became hot and then morphed into hyper converged [crosstalk 00:04:28].
Charlie Boyle, Nvidia:
Yeah.
Yevgeniy Sverdlik, Data Center Knowledge:
Perhaps a bit early for ...
Charlie Boyle, Nvidia:
Yeah. That was a bit early. A lot of that foundation became the converged systems that we were working on at Oracle, things like exodata, ExoLogic, the private Cloud system that they had. All of those were hyper-converged systems but just in a much larger form [inaudible 00:04:49]. Those were rack level designs but to solve a very specific customer problem.
Yevgeniy Sverdlik, Data Center Knowledge:
You're not the first person that I interview and then I start researching them and it turns out they came up through Sun, ended up at Oracle, and so to ask this question, how did things change at work once Oracle took over Sun? How did things change for you?
Charlie Boyle, Nvidia:
Of any acquisition, there were positives and negatives. I really appreciated Oracle's sales culture, they want to win, they were very aggressive. Lots has been written of the Sun going down in the final years. Oracle really had a passion for getting things out to their customers and really monetizing things well.
Yevgeniy Sverdlik, Data Center Knowledge:
They have the business end of things down.
Charlie Boyle, Nvidia:
As an Oracle customer, lots of times people say they monetize things a little too well but at the same time, it was a very strong culture but they had a very strong engineering presence as well. Lots of bits and pieces of things that used to be Sun got assimilated into Oracle. I actually moved ... When I joined, I didn't join the hardware division at Oracle because I was responsible for Solaris virtualization and system management. I joined the software team, which was a core part of Oracle so, to me, it was just like, "Okay, I'm just a part of core Oracle now" and it was a positive experience in a lot of ways and, of course, there's a lot about culture and other things. It fit fine for me. Some people didn't like it as much but, for me, it was fine.
Yevgeniy Sverdlik, Data Center Knowledge:
You led development of convergence restructure products at Oracle. You're now running AI infrastructure business at Nvidia. How, maybe in general terms, this may sound like a weird question to you but do your best, how is the world of AI infrastructure, that business, that ecosystem, different from the world of converged infrastructure, maybe the more traditional hardware world?
Charlie Boyle, Nvidia:
At the base, they share a lot of the same fundamental things. I mean, the reason that people like converged infrastructure is from whatever partner you're buying it from, whether it's converged storage or converged compute solution, the value to the end user is it's as easy as possible to get the most benefit out of it. You don't spend a lot of time researching hardware, setting things up, updating things.
Charlie Boyle, Nvidia:
The vendor has done all that work for you and when we started DGX, and I can't say that I started DGX, the product was under development before I joined Nvidia, I joined probably three weeks before the initial product launch, the vision and part of what attracted me to Nvidia was the vision was very similar, which was we didn't want to build a server. We wanted to build the best AI system for users at the time and back five years ago, really nobody knew what dedicated AI infrastructure was.
Charlie Boyle, Nvidia:
You had a lot of people experimenting, you had a lot of folks in research and one of the things that we noticed internally at Nvidia is even our own experts in the space, everyone has something different underneath their desk, they were trying different things and Jensen's vision was very straightforward is there's one thing that is the absolute best to do AI work and I want to build that and I want to show the market that so that everyone can learn from it and we can expand the overall AI ecosystem.
Yevgeniy Sverdlik, Data Center Knowledge:
How did you end up at Nvidia? How did that transition happen?
Charlie Boyle, Nvidia:
It was through mutual friends and colleagues, some of which were already at Nvidia, and they said, "Hey, we're doing something new here." Nvidia had always been known as a fabulous technology company but more of a chip company at the time. We sold data center infrastructure but it was chips and boards to OEMs and the gaming side, of course, through our AIC partners but as we wanted to bring new technology to the market in a rate that was potentially faster than traditional system builders were comfortable doing those things, we said, "If we're going to tell the market that AI is real and that you should invest in it now, we need to show them how to do that. We need to teach them how to build a world class AI system" and at the time, Nvidia was just announcing its new Pascal Generation of GPUs, which had a lot of never before seen technology. It was the first time you saw NV Link, a private interconnected link to all these different GPUs together.
Charlie Boyle, Nvidia:
There was also a new form factor. Of course, everyone is familiar with the PCI form factor. You put those cards in the system. In order to get the power of the NV Link to connect all the cards together it was actually a different physical form factor. It's what we call SXM. It's a rectangular module. You've seen all the pictures of it. But that form factor needed a new system to be built around it. You couldn't plug it into a standard PCI server so we had to teach the world how to build this new form of server that combined the fastest GPUs with this new NV Link technology and that's what became the DGX 1 product.
Yevgeniy Sverdlik, Data Center Knowledge:
Was that what drew you to Nvidia, an interesting product you wanted to get involved in?
Charlie Boyle, Nvidia:
It's a combination of a lot of things. You know, interesting product, interesting space, also a great team. As anyone would tell you after you've been in a career for a while, the team is as important as the work that you're doing and Nvidia just had a great engineering culture, a great innovative culture. I've looked for that in all the companies that I've worked with. The team was great there. I talked to them about what they were doing, why they were doing it. It just felt like a natural fit.
Charlie Boyle, Nvidia:
As I said, I had some friends that were working there, some of them had been working there for a few months before I got there, some had been there for a few years and all of them said this is a great place to work if you're used to X company back in this point in time, they're very much like other places that I had worked. It really had that spirit of just everyone working together for a common goal to get something done and to do something that hadn't been done before.
Charlie Boyle, Nvidia:
That's kind of one of Jensen's mandates to all of us is if we're going to invest time in doing something, it has to be both hard, and it has to be something that only we can uniquely do and it also has to be fun to work on. You meet those three tests for a product and it's a great opportunity.
Charlie Boyle, Nvidia:
I started as an army of one but quickly hired a lot of great folks that I knew through the industry and that I had worked for before and fast forward five years, we're on our third-generation product and have a great team behind it.
Yevgeniy Sverdlik, Data Center Knowledge:
There's a lot of attention being paid to AI. Obviously, it's a world-changing technology, there's a lot of optimism and a lot of concerns about its implications and the basic question is how do we make sure it's not built early on in ways that will cause irreparable harm to society. How do we make sure it's not putting minorities at a disadvantage or how do we prevent it from being abused by military? How do we prevent it being abused by deep fakes, things like that?
Yevgeniy Sverdlik, Data Center Knowledge:
As someone who lives in the world of computing infrastructure that enables this technology, do you feel a sense of responsibility for ensuring we don't get it wrong or do you not worry about those things because your focus isn't on the actual software that trains the AI models and makes those potentially consequential decisions?
Charlie Boyle, Nvidia:
Well, I mean, as a consumer of AI technology, we all interact with AI technology on the consumer level on a daily basis. I hear those concerns, but the thing that I find great hope in is all of the developers, the ecosystem partners, everyone that helps make AI a reality, they're all thinking about these things too or trying to take out bias from systems, try and make sure that AI can really help you.
Charlie Boyle, Nvidia:
The whole goal and you've heard Jensen say it a lot of times but I think it really comes to heart is AI isn't out there to replace people. It's to make people better. It's to help people along so that your life can be easier, your life can be better, you can get better access to information.
Charlie Boyle, Nvidia:
Back when we were all traveling and you're sitting on a phone call on hold with an airline because your flights got canceled, well, if the AI is better, it should just rebook your flights because it knows your travel pattern and everything and I shouldn't have to sit on the phone with an airline for an hour.
Charlie Boyle, Nvidia:
There's a lot of good things in it. There's a lot of people out there that will point out it has potential to do bad things. The same things have been said when the computer industry just started, "Oh, you're going to put all these people out of work." All these things. Look at the massive economic gain that has come out of just everyone having access to a computer system. I have high hopes for AI and I think the right people are thinking about the right areas to help make sure there's bounding boxes around things, to make sure that there's human in the loop when you're facing critical and difficult decisions.
Yevgeniy Sverdlik, Data Center Knowledge:
The obligatory chip shortage question, how has the chip shortage been affecting the GX group, the group you're in charge of?
Charlie Boyle, Nvidia:
It impacts everyone. If I had a hat on, I would take my hat off right now to our Nvidia operations team. They are I think ... Not I think. I know they're the best that I've ever worked with. While there are shortages, obviously, we build our own [inaudible 00:14:59] so I'm not short of those for the very large scale systems but it's little things. It's like resister, it's a little transistor somewhere, it's a power module but I think coming into Nvidia and understanding how well our operations staff plans, plans years in advance on stuff, they've really been able to protect us on the things that we can be protected by.
Charlie Boyle, Nvidia:
I can say for DGX supply, while there's a lot of hard work going on behind the scenes to make it all smooth and perfect, we haven't been impacted at the output end of that. That doesn't mean that we haven't had to do a lot more work but the team has been excellent at their craft to make sure that we always have second source, we always have alternative components. We've got months and months of planning behind every build of systems that happened in our factories for these things because at the end of the day, we're not delivering a part to a customer with DGX. We're delivering a whole solution, a whole system that's got software, it's got storage and it's got networking.
Charlie Boyle, Nvidia:
We can't be short one screw in the system. One screw means I can't ship the system to a customer so the team just does an excellent job forecasting, planning for that. It's really great to see that whole process and be part of it to make sure that we can deliver to our customers what they need.
Yevgeniy Sverdlik, Data Center Knowledge:
You're saying it hasn't caused any delays in shipments but there's been a lot of worrying and headaches and hard work behind the scenes to make sure ...
Charlie Boyle, Nvidia:
Yeah. I mean, our goal is to shield our customers from that. They expect a complete product from us. We plan well with them. That's one of the things I really like about working on this type of product is even though all of our systems are sold through partners and channels and those things, we have a lot of direct visibility with our customers. They know what our lead times are.
Charlie Boyle, Nvidia:
I'm very happy to report all throughout the chip storage and coronavirus crisis, we've still standardized on our lead time and haven't extended that so our customers know when AI is successful for them and, say, they've started out with just a handful of DGX systems, when they come back to me and say, "Hey, Charlie. I need 40, I need 100, and I need them next quarter", I can still deliver that stuff in standard lead time.
Yevgeniy Sverdlik, Data Center Knowledge:
Okay. Let's talk about data centers. You like to say that your data center AI infrastructure product design benefits from Nvidia engineers using Nvidia hardware internally. How big is that internal AI computing infrastructure at Nvidia now? How many data centers?
Charlie Boyle, Nvidia:
Oh, how many data centers?
Yevgeniy Sverdlik, Data Center Knowledge:
How do you quantify how much power [inaudible 00:17:56]?
Charlie Boyle, Nvidia:
I can quote you the number of DGXs. We're probably close to 3000 DGXs deployed internally across the range of DGX 1, DGX 2s and the current system, the DGXA 100. I don't really count the DGX station numbers in there because we have those sitting under people's desks and cubes so they're not in the data center.
Charlie Boyle, Nvidia:
In the data center, for the data center infrastructure that we refer to as Saturn 5, which is all of the Nvidia clusters put together, that's around 3000 systems. It's in multiple data centers. Off the top of my head, I don't know exactly how many. They're mostly all around Santa Clara. I think there's some development that's out of state, probably in Nevada, at this point, but we're probably more than five, less than 10 data centers. We try to keep a lot of our systems together because you get economies of scale but you also get ... As you put more and more of these AI systems together, in close proximity, you have a lot more flexibility for your users.
Charlie Boyle, Nvidia:
One of the things that Jensen talked about in the keynote was these massive new models, whether it's recommender or speech. Some of these things that would take hundreds of machines to try in a ... It would still take them two weeks to train even with a couple hundred DGXs. You need all those systems physically close to each other because you couldn't put those, say, 500 systems in 100 different data centers. It wouldn't be efficient because all of those systems talk to each other over a local high speed InfiniBand network. You want to have centers of gravity around your AI system deployments.
Yevgeniy Sverdlik, Data Center Knowledge:
You're splitting workloads across multiple systems, right?
Charlie Boyle, Nvidia:
Right. I mean, those same 3000 or so systems, all of those things are accessed by everyone inside of Nvidia. One of the interesting things is just by being an Nvidia employee, you automatically have access to up to 64 GPUs in the data center. Now some groups, like ourself, driving car language model teams, those teams have hundreds, if not thousands, of GPUs but that's why Jensen made the investment many years ago.
Charlie Boyle, Nvidia:
We started with 125 DGX 1s four plus years ago and as soon as we turned on that centralized infrastructure, it was instantly full and so then we kept adding and adding and adding to it because we found the value of having a centralized infrastructure means that people don't have to worry about when am I going to get access? How long is it going to take me to do something? If the systems are available, they can get their work done and they can expand and contract as they need it. We've put a lot of work together to make it easy for our internal users and that same technology, that tooling, that information flow, actually makes it into the product so all of our customers outside of Nvidia can use it.
Yevgeniy Sverdlik, Data Center Knowledge:
These clusters sit in co-location facilities?
Charlie Boyle, Nvidia:
Yes. I can't say Nvidia owns no data center space because, of course, we run our GFN network all over the world for gaming but for our DGX systems, outside of the ones that are in labs in Nvidia buildings, all the large clusters are in co-location because that's not ... While we have strict design specifications and we push our co-location partners to the art of the possible and push the limits on power and cooling, it's not our core business to build data centers. We let the experts build that and we give them all of our requirements to push not only for what they're building today but what they're building in years to come.
Yevgeniy Sverdlik, Data Center Knowledge:
Is Nvidia the world's largest user of DGX?
Charlie Boyle, Nvidia:
No.
Yevgeniy Sverdlik, Data Center Knowledge:
No?
Charlie Boyle, Nvidia:
It's not. I have customers that have larger deployments than we do.
Yevgeniy Sverdlik, Data Center Knowledge:
Okay. Can you share how many or who they are or anything about them?
Charlie Boyle, Nvidia:
I can't. Unfortunately, that's the power of when you have very large customers. You can search through some public stuff on some large customers that we've announced over the years but there are a handful of customers that do have more internally than we do.
Yevgeniy Sverdlik, Data Center Knowledge:
This internal cluster, the first internal cluster you mentioned there were 125 DGX computers, about five years ago, you deployed it. How did that project come about? Why did you guys decide to do that?
Charlie Boyle, Nvidia:
That was really based on an early Jensen meeting about how do we get our own internal users something better? Because as he looked across the company, and I can take no credit for this, I just helped implement it, there were all these requests for a work station here, a server there, from all different parts of the company that was trying to do something with AI and we knew we wanted to centralize things but all of us that had been in the IT industry for a long time, going from when people owned systems, "This is my system. It sits underneath my desk" or it's in this one rack space, "This is mine" to go from that to something that is a centralized shared resource, every IT organization struggles with that. How do you do that? How do you get your users to move? How do you motivate them to move so that you can decommission the ineffective barely used systems that they do have?
Charlie Boyle, Nvidia:
We just made a very simple proclamation backed up by Jensen's investment in it is we're going to give you ... To all of our users in the company, we're going to give you something that is so much better than what you have access to today in your lab or under your desk, that there's no reason you'd want to keep doing it the old way.
Yevgeniy Sverdlik, Data Center Knowledge:
No brainer.
Charlie Boyle, Nvidia:
Like I said, once we turned it on, I think it was turned on on something like a Thursday or a Friday and people started to know about it. By Monday morning when people came in, it was already 100% utilized. Then we needed to apply a bunch of software and other access controls to it because early on, the guy that got in Monday morning and launched 1000 jobs took up most of the cluster. Then we said, "Okay, it's great it's getting used. Now we need to make sure that it's used fairly across the company" and we started writing software, writing quotas, writing queuing, but it's super successful.
Charlie Boyle, Nvidia:
Anyone inside the company, you want to try an experiment, you've got a great idea, can try something no approval. Like I said, up to 64 GPUs, that's up to eight DGXs but for the big stuff that we do, if we say, "Hey, we could really make a difference on a medical language model and we need 1000 GPUs to try it on for two weeks", that's something we can do. Having that type of infrastructure really unlocks the imagination of the data scientists we have, the researchers we have, to push the limits of what's possible and that's what you see us ... Not only did we have some hardware announcements at GTC but most of Jensen's announcements were actually software-based and that's because we can do that work and we have the infrastructure.
Charlie Boyle, Nvidia:
What I've experienced over the past number of years is a lot of our customer base doing the same thing. They're going from disparate infrastructure, one group gets to buy a DGX here, another group gets two there, to saying, "You know what? I just want to buy" what we call a super pod that starts with 20 notes and up, "And I'm just going to use it as a centralized resource for my company and when I need more, I just add to it. I don't need to go give Bob two more or Lisa two more. Just, 'Okay, if the company needs four more, just put them in our cluster" and that's the great thing about the designs that we've come up with. It just scales linearly. You need more systems? We've got a road map that goes from two systems to thousands of systems all in a standardized data center deployment.
Yevgeniy Sverdlik, Data Center Knowledge:
You're now at several thousand DGXs being used internally. What are some of the biggest lessons you guys have learned about running and using AI hardware at scale in data centers over the last five years?
Charlie Boyle, Nvidia:
A lot of it has been around software development to make it easy to run very large scale AI jobs. If you're running on a single GPU, even a single system, there's really no software work to do. You run your [inaudible 00:26:31], you run your TensorFlow, whatever framework you're going to use, and you're done. There's nothing that needs to happen.
Charlie Boyle, Nvidia:
When you start doing what we call multi node training, scaling that training across lots of systems, 200, 400 of systems, there's things that you need to learn in the way you launch those jobs, monitor those jobs, what happens if you're doing a two week training run and one system of your 100 system cluster goes down?
Charlie Boyle, Nvidia:
A lot of our enhancements to software that we released to customers through our NGC repository, we've taken all of our own internal lessons learned on how to optimize those things and just how to operate those.
Charlie Boyle, Nvidia:
A lot of what we deliver is knowledge, not even just software. We help teach our customers, okay, if you're going to try to run thousands of systems and you need to run a job that's going to run for an hour on those thousand systems, that's a fairly ... While it's a big system, it's a fairly simple thing. You don't need to think about a lot of things. For automotive training, they may have jobs that run over hundreds of systems that last for three weeks.
Charlie Boyle, Nvidia:
Now you don't want one node in that system on your third week to destroy your entire training run so there's things like check pointing, there's saving various things, saving various states. As you're running very long-term jobs, you know just like standard statistics, averages, after so many days, weeks, or months, you're going to have a system hiccup when you're running hundreds or thousands of systems.
Charlie Boyle, Nvidia:
We do a lot of to help provide the software, the settings, the tools, so that customers can feel safe running very long running jobs over hundreds of systems without worrying about a single failure in the cluster is going to take out your three weeks of work.
Yevgeniy Sverdlik, Data Center Knowledge:
That's an interesting point because kind of the traditional thinking in the HBC world, which I know is different from the AI world but it's very closely related, is that infrastructure isn't as mission critical as, say, a bank's data center and so you don't need to build as much redundancy on the data center side of things but what you're saying is, "Well, if you have a long running training job, that's very mission critical and nothing can fail there" or that whole run is screwed.
Charlie Boyle, Nvidia:
Yeah. The high availability is different because a lot of people come from an enterprise background. Obviously, I was at Oracle. Your average database transaction is sub-second so if you have a system failure and you got a fail over and you have to retry a couple transactions that were sub-second, no big deal. You didn't lose anything.
Charlie Boyle, Nvidia:
If you're looking for a singular answer at the end of multi weeks, you need to build some redundancy in the software layers so that things can retry, reset and those are the knobs that we help teach our customers to [inaudible 00:29:34] to say, "What's your risk exposure? How often do you want to basically recollect all your data and say I'm at a checkpoint here?" Some people checkpoint once a day because it's fine for them if they lose work. If they rerun a day's worth of work on a node, it's not that big of a deal. Some customers checkpoint something every hour and it's really up to their risk profile and how much they think it's going to impact the overall run.
Charlie Boyle, Nvidia:
Of course, if you're running on hundreds of systems and you lose one hour's worth of data on one system, your overall training time isn't going to go up that much but if you're running hundreds of systems and you lose an entire 24 hours, that's a bigger impact. That's why having that close relationship with our customers, that as being a DGX customer, I tell everyone you're part of the family now, you're part of my family and you can call us up and ask us data center questions like what's the optimal way to cool my data center? How do I do [inaudible 00:30:32]? What do you recommend? All the way through to what's the right way to run a Pi Torch job on 200 DGXs for a massive NPL language model?
Charlie Boyle, Nvidia:
Half of our ... It's not an exact number but I would venture to say half of our people in our technical SA community have advanced degrees to PHDs. I was shocked coming from Oracle and Sun where we had excellent technical field people to talking to the average technical field person at Nvidia and they've got a PHD in linear algebra or computational microscopy or other things. It's just a fabulous team with just an immense amount of knowledge.
Yevgeniy Sverdlik, Data Center Knowledge:
Hypothetically, I'm an IT or data center manager at a corporation, some unit in my company wants to start training and deploying AI models. They haven't figured out what infrastructure that they're going to use, maybe they're going to use Cloud or something else. Help me understand what's the spectrum of scenarios of how this affects me as a data center manager, as an IT manager.
Charlie Boyle, Nvidia:
Yeah. There is a broad spectrum of how people do AI. Lots of people start in the Cloud and Nvidia has its top end GPUs in every single Cloud and if you're just trying, experimenting things, you don't know what's going to work yet and it's easy enough for you to get your data into the Cloud, that's definitely a great way to get started.
Charlie Boyle, Nvidia:
It's a lower cost entry point in a lot of ways but what we generally find from customers is as they get serious about AI training, doing that bit of work, that they want something that's closer to them, closer to their data. Now that brings up the data gravity question, if a customer comes to me and says, "Hey, Charlie, I've been an online business since we started the business and all of my data is in Cloud", well, you should probably do your AI in the Cloud at that point.
Charlie Boyle, Nvidia:
For the majority of our customers, they have a fair amount of data that they want to use that's central to their enterprise. It's in a 10 year data store they have, it's their CRM record for the past 20 years, it's customer behavior from the last five and that's generally somewhere on prem. When I say on prem, that doesn't mean they own their own data centers. Lots of them are in co-los. It's in a facility that they control the data on.
Charlie Boyle, Nvidia:
We always advise people move the AI compute to your data, move it as close as possible because you're going to spend so much time pulling the data into the AI infrastructure to do training, to do inference, that it doesn't make sense to keep moving it back and forth between the Cloud and on prem or vice versa.
Charlie Boyle, Nvidia:
From an IT administrator's perspective, one of the ... The kind of two scenarios that come up often as I talk to higher level executives in companies is their users are starting to work in the Cloud and train on the Cloud, lots of times they'll come back and say, "Wow. I got my quarterly bill and it was pretty high. I didn't expect that" because AI has an unlimited appetite for compute so as an IT administrator to get ahead of it, if your users are starting out on the Cloud, educate them on Cloud usage, make sure when they're done using something, turn the instance off. If your job is going to take eight hours and it's going to end at 5:06, don't wait until the morning to turn it off because you're burning valuable time.
Charlie Boyle, Nvidia:
The on premise side, one of the things that we've really tried to do with the DGX platform is make it easy for IT. At the end of the day, people look at the systems and they say, "Wow. This is a big system. It's six rack units. It's five to six kilowatts. This is something I'm not used to." At the end of the day, it's a Linux box. Positioning and the way people use our systems are delivering a complete solution but to an IT administrator, it's just a bigger Linux box. We use standard Linux distributions. We do all the software QA on all of the additional things that we put above the Linux layer on and we fully support that to the IT department but they shouldn't be scared of it. Whether they buy the DGX, they buy an OEM server platform using Nvidia GPUs, they can operate it the same way. [crosstalk 00:35:07].
Yevgeniy Sverdlik, Data Center Knowledge:
If I'm the data center manager and, say, this is being deployed in my on prem data center, should I be scared of it? Maybe there's a power density [crosstalk 00:35:17].
Charlie Boyle, Nvidia:
You absolutely shouldn't be scared of it. That's just planning. One of the things that we've talked about and you may have seen some of our material out there is empty is the new green. When people say, "Oh, one of these DGX boxes or one of these OEM servers is five kilowatts and I only have 10 kilowatts available in my rack or I have 12 kilowatts or seven", they say, "I've got all this empty space."
Charlie Boyle, Nvidia:
As a long-term data center guy myself, you look at empty space and you're like, "I'm not using my space well" but the amount of acceleration and computation that you get out of these systems replaces so many of your standard [inaudible 00:36:01] pizza boxes that you think back four or five years ago, we were making comparisons to one DGX 1 replaces a few hundred Intel pizza box servers. We don't even make those comparisons anymore because the number is into the thousands at this point.
Charlie Boyle, Nvidia:
If you're doing AI training and even AI inference, you're using GPUs, the systems are going to be more powerful but it's not that it's something completely different. It's just I've consolidated multiple racks worth of infrastructure into one server. Okay, yeah. That one server takes up more power but with good planning, that doesn't really matter to people as long as you can fit at least one of these systems in a rack.
Charlie Boyle, Nvidia:
Like I said, that's why we have such great OEM partners as well. We only build one DGX system. It's turned up to 11. It's the best. Our OEM partners do build smaller systems in cases where people need different density, different power levels, but regardless of whether you buy mine or whether you buy an OEM server, as an IT person you absolutely shouldn't be scared of this. If you're scared of it then you haven't looked into it enough. Call us. Like I said, even if you don't buy mine, we'll still give you the advice on what the right way to do it is.
Yevgeniy Sverdlik, Data Center Knowledge:
Do you run into IT people who are "scared" of it?
Charlie Boyle, Nvidia:
Four years ago, yes. I'm seeing a lot less of that now. I think people understand it well enough. Does an IT person need to know NV Link gen three? No. They're never going to program it, they're never going to touch it. Their users aren't going to program and touch it. It's software that we do.
Charlie Boyle, Nvidia:
Four or five years ago when we first introduced this box and it had eight InfiniBands on it and it was gold and it weighed 100 pounds and took three kilowatts and people were like, "I've never seen something like this before" but as AI has become more mainstream and they've looked into it, most IT people are like, "Yeah. It's running standard Linux" and that's one of the things we've done in the product over time when we first launched the product, we had our own OS, which was [inaudible 00:38:16] a bunch of Nvidia-specific stuff and that was because we just wanted a simple experience for all of our end users and because of a lot of feedback from IT organizations, they said, "Even though we get it, it's the best performance, it's everything, Nvidia has done all the work together but we're really a Red Hat shop and we're a one Linux shop. How are you going to help us?"
Charlie Boyle, Nvidia:
In year two, we introduced Red Hat support and that's just as good, obviously. You'd have to have a Red Hat license for that but that's something we've made easy for users. We continue to put out new software, new scripts to make setup and configuration easy so after you get past the power issue, a lot of people say, "You've got InfiniBand. I don't run InfiniBand in my data center." Well, InfiniBand is just the compute fabric to connect the DGX together. You don't have to manage it at all. It's a physical layer connection. It shouldn't be scary for you. It's just part of the solution.
Charlie Boyle, Nvidia:
Once people look at that, and like I said, four years ago, three years ago there was definitely more concern with people. I spent a lot more time explaining this to IT folks and after you explain it to them one on one, they're like, "Oh, yeah. I get it. It's not that difficult." A lot of it is because we've published papers, published information on how we do it and we do it with very few IT staff. We don't have a massive IT organization running these 3000 systems. For the amount of compute power those things put out to the amount of IT staff [crosstalk 00:39:46].
Yevgeniy Sverdlik, Data Center Knowledge:
How big is that organization?
Charlie Boyle, Nvidia:
I don't know off the top of my head but it's smaller than you would expect for an organization running thousands of high end servers. At the end of the day, they're all the same. That's what really helps our customers with our design is that every single DGX that I've ever shipped out of the factory, I have hundreds, if not thousands, of those exact same systems at the exact same software level running internally.
Charlie Boyle, Nvidia:
As an IT administrator, the number one thing that caused me pain in the past and I'm sure causes my brethren pain today is when you've got lots of servers that are different, fundamentally different like I can't use the same S bios, I can't use the same BMC settings on these because some application wants something different. Every single DGX we ship is exactly the same, can be updated the same. To operate them, if you take a very big picture view, it's an IT administrator's dream. It's a completely homogenous system.
Charlie Boyle, Nvidia:
Now users and applications on top of that, we all deal with those things but from basic IT, it's actually fairly simple to update. We put a ton of engineering into making the system easy to update. We published a single package that updates absolutely everything on the system, all the farmer or the S bios, all the settings that you need, so you're not sitting there like, "Oh, I've got to run this patch, this other patch. Do I need this patch?" It's just one update container we give you a few times a year and it just goes [crosstalk 00:41:23].
Yevgeniy Sverdlik, Data Center Knowledge:
Let's talk about the future a little bit. Companies have been able to get away, as you've mentioned, with hosting AI hardware in the data centers or in co-los they have without adding fancy things like liquid cooling, by basically just spreading the hardware across the data center floor, if they have enough data center floor. Do you see a point in the future where that will no longer be an option and the hardware will use so much power that it will have to be liquid cooled no matter how much data center space you have available to you?
Charlie Boyle, Nvidia:
For the foreseeable future, I believe we'll always have an air option. There will be some designs that will be more optimal for liquid cooling but that's part of my job, the team's job is to give customers the right type of designs for where the market is.
Charlie Boyle, Nvidia:
Right now, we don't sell any liquid cool designs. In the future, I definitely see that but it doesn't mean you have to go back to the old mainframe days. I can tell you the very first data center back in the mid '90s that I was retrofitting to do hosting, when we pulled up the floor tunnels there were a ton of liquid pipes because it used to be an old mainframe data center. Then the world moved away from that because that was a very complicated infrastructure but liquid cooling has come a long way and so I think in the future, people will have a lot of viable options, whether you just want to do air-only, mixed air and liquid, or facility liquid.
Charlie Boyle, Nvidia:
I see things coming forward where you can have a very high powered server go into a local liquid loop heat exchange or somewhere else in your data center without needing to retrofit your entire data center for facility water. I would advise most customers, if you were building your own facility moving forward and you're not going to go co-lo, you should at least have a plan for some liquid ... Whether you're using our stuff or not, I think that is going to be a way to get greater efficiency in your cooling but it is a fairly big uplift if it's the first time you're doing it or if you're trying to retrofit something, it's a big uplift. If you're planning it in from day one, it's not actually that expensive to do it.
Yevgeniy Sverdlik, Data Center Knowledge:
Words of wisdom from Charlie Boyle, guys. If you're building a data center today, make sure it can do liquid cooling.
Charlie Boyle, Nvidia:
Yeah. That doesn't necessarily mean you've got to run all that infrastructure right now but, like I said, we've got a great set of co-location partners around the world. We've been working with them for years now. I think almost all of them will have some liquid cooling capabilities in the future but in a lot of cases, it's in a plan meaning when I need it, I put the chillers here, I put the pipes there, I don't need to rip up everything.
Charlie Boyle, Nvidia:
Even as an end user if you were building your next large corporate data center, have a plan that you can accommodate space for that type of equipment. It doesn't mean that you've got to spend the capital infrastructure to do it today but you'd feel bad in three years and your data center is half full and you've realized you need liquid cooling and you've got to rip up a bunch of stuff. It's probably better to start that planning.
Charlie Boyle, Nvidia:
There's lots of experts across the industry if you're thinking about putting in liquid cooling or partial liquid cooling because for a lot of customers, they probably won't need to have liquid cooling everywhere in their data center because the world is not going to be there for a very long time but you'll need pockets of it in places. Figure out a plan in your own mind of, "Okay, I'm going to dedicate this corner of my data center because that's closest to the wall to do liquid cool infrastructure." It's common sense things like that.
Yevgeniy Sverdlik, Data Center Knowledge:
That's one of the reasons if you like co-lo is really the natural choice for all of this stuff. They give you the quality infrastructure. You don't have to manage in most cases. They ensure you have expansion head room. They have lots of locations and all of that is built into their business model. Charlie, do you think co-location facilities will be the primary way enterprises will house their AI infrastructure in the future for all those reasons?
Charlie Boyle, Nvidia:
You know, I would think so in a lot of ways. Just like our own journey at Nvidia, we're doing very well as a company. We have the capital, we can build our own data centers but we go to co-lo. Why do we do that? Because it's their core business to do these things.
Charlie Boyle, Nvidia:
Now we do push them and we give them requirements and really high powered targets because we want to push limits but they are the experts in this and we work hand in hand with our partners so that they understand where we are today and targets where we may be in a few years so [inaudible 00:46:12] plan for those things.
Charlie Boyle, Nvidia:
I think a lot of folks have ... If you've got a corporate data center today, you should always think about what do I need physically close to me? Like I said, a number of our DGX systems are actually right around our headquarters inside of Santa Clara but then we have other stuff that's further away and the stuff that's further away is stuff we just know we never have to physically touch that often.
Charlie Boyle, Nvidia:
As a corporate user, you should always look at what are the things that IT needs to touch a bunch? Keep that close. What are the things that I need great performance and resilience and all the great things that co-los bring me that I need to be able to drive to and what are the things that I should cost optimize to say this is tried and true technology, I don't mind if it's a state away or two states away or three states away because I don't need to touch it that often and the co-location provider has fabulous services. If I'm ever in a state that I do need someone to physically touch the box, that's standard in co-lo today. You don't have to ask for those services extra. That's usually built into your contracts.
Yevgeniy Sverdlik, Data Center Knowledge:
Okay. That's all I have. Charlie, this has been great. Thank you so much.
Charlie Boyle, Nvidia:
Thanks so much for having me
About the Author
You May Also Like