What Enterprise Data Center Managers Can Learn from Web Giants

This month, we focus on the open source data center. From innovation at every physical layer of the data center coming out of Facebook’s Open Compute Project to the revolution in the way developers treat IT infrastructure that’s being driven by application containers, open source is changing the data center throughout the entire stack. This March, we zero in on some of those changes to get a better understanding of the pervasive open source data center.

Here’s part two of our interview with Amir Michael, who spent most of the last decade designing servers for some of the world’s biggest data centers, first at Google and then at Facebook. He was one of the founders of the Open Compute Project, the Facebook-led open source hardware and data center design community.

Today, Michael is a co-founder and CEO of Coolan, a startup that aims to help data center operators make more informed decisions about buying hardware and make their data centers more efficient and resilient using Big Data analytics.

Read the first part of our interview with Amir Michael here.

Data Center Knowledge: How did the idea to start Coolan come about?

Amir Michael: My team built large volumes of servers while at Facebook, hundreds of thousands of them. As we built them, we put them in the data center and then turned around and started working on the next generation of design and didn’t really look back to see how decisions we made during the design actually panned out operationally.

We made a decision to buy premium memory and paid more for that because we thought it wouldn’t fail. We made certain design decisions that we thought would make the system more or less reliable at a cost trade-off, but never actually went back and measured that.

And we’re always making decisions around what kinds of components or system to buy and trying to decide if we pay more for an enterprise type of component, or maybe we can do with a consumer type of component. New technology, especially new technology entering the data center, doesn’t have good information around reliability. You don’t have a track record around that.

When I was at Facebook, I started to look back and say, “Hey, so what were the operational costs of all these decisions we made?” And we didn’t have a lot of data. I started talking to peers in the industry and said, “Let’s compare notes. What does your failure rate look like compared to mine?” And there wasn’t a lot of information there, and a lot of the people in this industry aren’t’ actually measuring that.

The idea for Coolan is to create a platform that makes it very easy for people to share data about their operation, about failure rates, about quality of components, about errors that they’re generating, about the environments that their servers are running in, both utilization and also the physical environment around them, and make that as easy as possible to do so people can have this rich data set that we collect for them and analyze.

Once you have this large data set, not only are we measuring and benchmarking someone’s infrastructure, we can now allow them to compare themselves to their peers. Your failure rate is lower, and here’s why it is: because you’re running at optimal temperature, your firmware is the latest version, and it’s more stable. Now that we have this type of comparison, we add a whole new layer of transparency into the industry, where people are making decisions based on actual data, informed decisions, not trying to guess what component is right for them.

Once you have that, you’ll quickly understand which vendors are right for you, which ones are not right for you, and you’re making much more informed decisions about this large amount of capital you’re about to deploy.

It adds a whole new layer of transparency to the industry, which I desperately wanted when I was at Facebook. I wanted to know if I should go to vendor X or Y. I didn’t have information, and when you ask [vendors] about quality of the product, you didn’t get a good answer. They gave you some mathematical formula they used to calculate [Mean Time Between Failures], but it didn’t actually correlate to what was in the field.

DCK: The concept of reliability in the data center industry generally revolves around hardware, be it electrical and mechanical infrastructure or IT equipment. In the web-scale world, there is more focus on writing software that can withstand hardware failure. Is physical infrastructure redundancy on its way to obsolescence?

AM: My theory is that the most expensive way to build reliability is through hardware. If you’re going to use things like redundant UPSs, redundant power supplies, redundant fans, anything where you’re adding extra physical components, that’s a very expensive proposition, and actually in some ways reduced reliability.

RAID cards are a great example. Do you want to back up your storage? Do you want to be able to sustain drive failure? So, you’re going to add a new component in line. Well, guess what. That component fails too. RAID cards fail fairly often too. And what does that mean? In some cases you’ve actually reduced the reliability of your system, because now, if your RAID card fails, you’re not losing one drive, you’re losing all drives behind that.

But I understand why people do that. Old applications weren’t designed for scale-out deployments. They weren’t designed to sustain system failure, so oftentimes you have critical systems where if they go down, you lose an entire application.

The solution wasn’t to modify the software; it was to modify the hardware behind it, which was the more expensive route. Today, any modern software architecture assumes system failure, because you have to. Because no matter how reliable the system you’re trying to build is, it’s going to fail. It’ll just happen less often, but it will.

And I’ve seen everything fail. Even the most reliable systems fail. So that’s where the thoughts need to be more. How do you build software that’s more resilient, that can withstand system failure? Beyond that, how do you withstand rack-level failure? How do you withstand an entire facility failure, beyond just traditional DR?

Once you get to that, now you can strip out a lot of those redundancies. You’re building a system that’s much more economical, much more efficient, and you’ve done that through software changes. That’s the right way to think about it.

DCK: Hyperscale data center operators tend to be lumped into one group in the press, and the implication is that they design and operate their infrastructure in a certain way that’s pretty much the same across all of them. This isn’t true, and there are radical differences in approaches of Facebook, for example, versus eBay, Facebook using comparatively low power density designs and eBay going for the maximum density it can get. Is there a set of best practices that’s common among all hyperscale operators?

AM: If you look at hyperscale, there’s a base of best practices that everyone should be doing: containment, very efficient power distribution, efficient power supplies. Those are well-known in the hyperscale space.

There’s a lot of discussion around what is the right density. Across different large infrastructure operators, their environments still vary. You still have some that are in colocations, and maybe they’re leasing the whole building, but they’re still leasing from other operators. And you have people that are going in the middle of nowhere and building data centers for extremely low cost. They’ve minimized the cost per watt of their data center facility.

Those different requirements will cause them to build servers differently. You have one group that builds racks that are fairly low on power densities. And then you have the other extreme, which is people putting a lot of density into the racks, saying racks are expensive, data centers are expensive (and maybe for them, they are). Let’s get as much IT gear in them and utilize them as much as possible.

[High density] has a flipside to it. You have constraints. When you build a server that is very dense, has a lot of components packed into it, it becomes a challenge to cool it. It’s like sipping air through a very thin straw. You’re going to take a lot of energy to do that, whereas, when you have something that’s not as dense, it’s much easier to push air through that.

There’s obviously some sweet spot there, depending on the cost model. Do you want a lot of density so you can amortize your data center across more machines but pay the price as far as cooling goes? Or do you want to have a cheap facility where you can allow yourself to build things that are quite frankly easier to design, not as dense, and much more efficient at cooling? Or you can use things like 1.5U, 2U-tall heat sinks that are extremely efficient and easy to cool and require very little fan power.

Some of the operators, like eBay, fall on the dense side, which creates a lot of challenges, and I don’t know the entire story there, but if I were to compare that to the cost model at Facebook, that type of density wasn’t as attractive to us because of the extra overhead you pay for building something that dense.

You have important parts of the server that actually do work for you: the CPU, the DRAM, the storage. That’s where you want all your energy to go. You want 90 percent of the energy burned by those devices, because those are actually the ones that do work for you. But if you build something that’s dense, you’re now shifting a lot of the energy into the cooling system. That’s not as efficient of a system.

Comments

Plain text