When is the Best Time to Retire a Server?

There’s no magic number for the length of the hardware refresh cycle that works for everyone, but the set of variables that together determine the ideal time to replace a server is fairly uniform across the board. Identifying those variables and analyzing their relationship is a question Amir Michael and his team at Coolan, a data center hardware operations startup, recently asked.

Everything that has to do with managing and designing data center infrastructure cost-efficiently has occupied Michael for many years now. After five years as a hardware engineer at Google, he spent four years working on hardware and data center engineering teams at Facebook. While at Facebook he co-founded the Open Compute Project, the open source hardware and data center design initiative.

He and two colleagues founded Coolan in 2013 with the idea of using their years of experience with web-scale data center infrastructure to help other types of data center operators run their infrastructure more efficiently and cost-effectively.

In a recent blog post, Michael outlined the basics for calculating the best time to retire a server. Sometimes because of tight budgets and sometimes because it’s hard to predict demand for IT capacity, companies wait too long to replace aging hardware and pay penalties in hidden costs as a result.

Michael heard from one of his customers who said their company had some servers that were more than eight years old. “We’re just keeping it around because it’s there, and it’s an easy thing to do,” goes the typical explanation, Michael said in an interview. “It’s hard to think about all the different factors that go into making this decision.”

There is always a point in time at which holding on to a server becomes more costly than replacing it with a new one. Finding out exactly when that point comes requires a calculation that takes into account all capital and operational expenditures associated with owning and operating that server over time.

According to Michael, the basic factors that should go into the calculation are:

Cost of servers
Data center CapEx
Cost of cluster infrastructure and UPS
Cost of network equipment
Cost of data center racks and physical equipment
Data center OpEx

There are other considerations, such as increased failure rate as hardware ages, that weren’t included in the analysis on purpose.

The full breakdown of how all the factors combine over time to create a clear picture of the total cost of ownership is in Michael’s blog post. Essentially, the idea is that as hardware gets better, you can do more with fewer boxes, but that doesn’t mean replacing those boxes with new ones as soon as the new ones come out will add up to a lower TCO. There is a “magic number” of years at which point CapEx and OpEx intersect in a way that makes it more cost-effective to upgrade, but, as in the scenario he outlined, it’s usually not the first year.

In the hypothetical example Michael used, applying Coolan’s TCO model, total cost of a fleet of storage servers whose capacity totaled 100 PB over a period of six years was close to $8 million higher if they were replaced with newer, more efficient boxes one year later than if the owner held on to them for the entire six years. The gap narrowed to about $3 million with a refresh after two years and disappeared completely if the servers were replaced after three years. In other words, the six-year TCO for the same storage capacity was the same, had the servers been replaced with newer ones after three years or not.

Newer servers are denser, so you need fewer of them, which means you spend less in OpEx. If you keep old servers for too long, your OpEx, while staying at approximately the same level, starts to provide diminishing returns, and it becomes cheaper to replace them than to keep supporting them.

The problem of holding on to servers for too long is bigger if you consider that not only are companies supporting underperforming machines, many have servers in their data center that don’t run any useful workloads. According to some recent research, conducted by TSO Logic, a company that also looks at efficiency and cost of IT operations, together with Stanford University research fellow Jonathan Koomey, about 30 percent of servers deployed worldwide do not do any computing, representing about $30 million worth of idle assets.

Coolan’s TCO model for hardware is available for free on the company’s website (Google Docs spreadsheet). As Michael put it in his blog post, aging infrastructure costs more than many people think, but deciding when is a good time to spend the capital on new hardware doesn’t have to be a guessing game.

“With each new generation of hardware, servers become more powerful and energy efficient,” he wrote. “Over time, the total cost of ownership drops through reduced energy bills, a lower risk of downtime, and improved IT performance.”

Comments

Plain text