National Lab Reins in Data Center Management Chaos
Dept. of Energy’s Titan supercomputer, one of the HPC systems housed at one of the Oak Ridge National Lab’s data centers, took second place on the 2012 Top500 list (Photo: DOE)

National Lab Reins in Data Center Management Chaos

Team operating colo for government research projects transforms the way scientists deploy IT gear

Scott Milliken hates the often-heard saying that people are the biggest reason for data center outages. It’s not people, he says; it’s people who don’t know what they’re doing.

And it’s fair to say that most of the scientists that are customers of the data centers he runs don’t know what they’re doing when it comes to data center management.

Milliken is the computer facility manager at the U.S. Department of Energy’s Oak Ridge National Laboratory in Oak Ridge, Tennessee. Speaking Monday at the Data Center World conference in Las Vegas he talked about the challenge of managing one of the most chaotic types of data center environments and what he and his team did to rein in the chaos.

The close-to-15-year-old ORNL data center is a polar opposite of the data centers the likes of Facebook or Google operate. Those hyperscale facilities support extremely homogeneous IT equipment, and as a result are able to maximize standardization and reach extreme efficiency.

Partly because of the nature of the workloads running in ORNL’s data centers and partly because of the way the government funds its research projects, standardization at a level anywhere close to the level of standardization in hyperscale facilities is simply impossible.

Colo for Government Research

Milliken and his team provide data center services to a large group of users, each responsible for buying IT equipment to support his or her own computing needs. “We were almost like a colo for government research institutions,” Milliken said.

The data center has two stories, 20,000 square feet each. The first floor houses the lab’s three supercomputers, and the second floor is where scientists’ gear lives.

Historically, there wasn’t a formalized process for placing new equipment on the second floor. Researchers used their grant money to buy servers, racks, airflow-management solutions, and in some cases power distribution equipment.

Because grant money is scarce, Milliken’s customers got territorial about data center space they had been allocated and equipment they had purchased. If somebody paid for a power-panel upgrade, they were inflexible about who could and who could not use the panel, for example.

“Fiefdoms were created and maintained,” he said. “Just kidding. They were not maintained at all.”

And that was one of the biggest problems. Clear documentation and labeling of equipment is crucial in effective data center management, and most of Milliken’s customers weren’t very disciplined about these things.

Chaos is Unsustainable

The chaotic environment often led to availability issues, and the data center management team often found themselves spending time on resolving problems. The status quo was clearly unsustainable.

So they decided to make improvements by instituting new processes. One was strict enforcement of documentation.

The other was taking over the responsibility for supplying racks, airflow-management, and power distribution equipment. This was a good way to lessen the data center management burden on tenants and to standardize the infrastructure components coming into the facility. They also realized that it would cost them less to pay for the infrastructure equipment than to continue spending long periods of time on resolving problems that resulted from operating a chaotic environment.

Starting From Scratch

Very soon, however, the team realized that to really do things right, they needed a whole new data center, which is what they did. The new facility came online about one and a half years ago, Milliken said.

Since it was launched, no new equipment goes into the old facility. They have standardized the way they deploy servers in the new data center to the maximum extent possible given the nature of their clientele.

The team now uses a standard contained pod that includes 22 to 28 cabinets and has in-row coolers and its own electrical circuits. Using this approach has given Milliken and his staff visibility and control of their costs and timelines beyond anything that was possible in the old facility.

There are no more users rolling in data center cabinets of their own. There are no more wildly varying configurations. Expansion has become “predictable and repeatable,” he said.

The old data center will not be decommissioned. As different pieces of equipment it supports reach their end of life, they will gradually be phased out, and new replacements (if necessary) will be installed in the new facility. Once the old one is empty, Milliken’s plan is to gut and remodel it into a modern facility and manage it in the new more effective way.

TAGS: Manage
Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish