Uptime in Space and Under the Sea

The SpaceX Dragon spacecraft that took supplies to the International Space Station last week had something new on board: the first commercial computers headed for space. Usually, computers for space missions are special-purpose hardware hardened to withstand everything from the g-force on take-off to zero-gravity and the cosmic radiation – Earth’s atmosphere is the reason servers in your data center aren’t affected by that last one. Hardening a computer for space travel can take years, said Dr. Eng Goh, CTO of SGI, the supercomputer outfit Hewlett Packard Enterprise acquired last year.

“They spend so long hardening for the harsh environment of space that the computers they use are several generations old, so there’s a huge gap in performance,” Goh said in an interview with Data Center Knowledge. “For some missions, you could spend more time hardening the system than you use it for.” That specialized one-off hardware is also expensive and doesn’t let you take advantage of the economies of scale technology typically offers.

Goh is hoping to equip astronauts with the latest available hardware that can be loaded with standard software for general-purpose computing, plus intelligent, adaptive controls that shift the burden of system hardening from hardware to software. The two water-cooled Apollo 40 servers HPE sent to spend a year on the space station came straight from the factory, with no hardware hardening, after passing the battery of NASA tests required to go into orbit; no modification was needed, which means they should do well in other difficult locations too.

The Spaceborne Computer, as Goh calls it, is an experiment to discover what impact the harsh environment in space actually has on unhardened hardware, and what you can do in software to reduce that impact. The idea is to reduce the servers’ power consumption and operating speed when higher levels of operation are detected to see if that’s enough to keep them running. “Can we harden the computer using software? That’s the question we want to answer.”

Hardening systems for difficult environments is becoming a more mainstream issue, Christopher Brown, CTO at the Uptime Institute, a data center industry group, said. As society becomes increasingly dependent on compute and communications technology – satellite communications, GPS, computer-assisted aircraft navigation, and so on — such research grows more relevant in places outside a few niche applications. “It is really moving beyond the fringe of people and groups with very specialized purposes to a point [where] it can impact all people.”

Lessons to Come for Space and Earth Computers

The servers aboard SpaceX Dragon are in a locker connected to power, Ethernet, and the station’s chilled-water system. The locker by the way isn’t designed to protect the machines – it’s there just to store them. They have SSDs rather than hard drives that could be affected by zero-gravity and ionizing radiation; there’s a mix of smaller fast drives and larger slower drives to see which works better in space; there are also Infiniband interconnects, because copper connections could be more vulnerable to radiation than fiber. The team did tweak CPU, memory, and SSD parameters more than usual, but the servers are running a standard build of RHEL 6.8.

General-purpose servers would be useful for future astronauts, so it’s an interesting potential growth market for a company like HPE. “The market isn’t that small if commercial space travel goes the same way air travel has,” Goh pointed out, and space exploration is also where you really need edge computing. If we send an expedition to Mars, the 20-minute latency will mean earthbound systems won’t be suitable for any real-time processing like image recognition or predictive analytics.

But the lessons from space will also useful down here on earth. HPE hopes to apply what it learns to harsh earthbound environments and to generally teach computers to better take care of themselves. “The high-level goal is to give computers a self-care intelligence that tries to adapt to the environment it’s in through sensors and early warning systems,” Goh said. “Today we set aside some compute cycles for anti-virus; we should also set aside cycles for the computer to care for itself and defend itself. If you have, say, a billion operations per second, are you willing to set aside half a percent for anti-virus and maybe five or eight percent for self-care?”

Microsoft Learning in Another Extreme Environment

Those goals are similar to some of the goals of Microsoft’s Project Natick. When the software giant’s researchers put a 42U rack of servers retired from Azure data centers inside a sealed enclosure and sunk it to the bottom of the ocean half a mile from shore, one of the goals was to learn how to speed up data center deployment in any environment.

“Today, it takes a long time to deploy a large data center,” Ben Cutler, one of the Project Natick researchers, told us. “It can take two years, because I’ve got to find some place to put it; I’ve got to get the land; I’ve got to get my permits; I’ve got to build buildings. Even if I have cookie-cutter data centers that I build the same everywhere, I still have to deal with the fact that the land is different, the climate is different, the work rules and the building codes are all different, how the power comes in is different. It just takes a long time.”

Sometimes there’s a spike in demand for cloud services in an unexpected place, and Microsoft wants to be able to respond as quickly as possible, Cutler went on. “Our motivation was, can we develop the ability to deploy data centers at scale, anywhere in the world, within 90 days from decision to power-on?”

Microsoft

Project Natick, Microsoft’s experimental underwater data center, being deployed off the coast of California

Microsoft has developed the process for installing fully populated racks straight in new Azure data centers for the same reason. “Something logical that we don’t usually do is to treat buildings as a manufactured product,” Cutler pointed out. “With a laptop or a phone, we pretty much know exactly how that’s going to behave, and how much it will cost before we build it; and you can get one quickly because when you order it, it’s pulled off the shelf and shipped somewhere. We want to get the same thing for data centers.”

Designing for Hands-Off Operation

The ocean isn’t as harsh an environment as space, but it can also be much milder than dry land, with its hurricanes, temperature swings, and other extreme weather. That means in the long run it could even be cheaper to make a data center reliable under water than on land. For one thing, cooling could cost only 20 percent of what companies spend today, Cutler said. Today’s data centers mostly rely on air cooling, which means they’re relatively warm. “Our hypothesis is that if I have something that’s consistently very cold, than the failure rates are lower.”

Server failure rates take on a whole new level of significance here. Underwater data centers will be sealed units designed to work without maintenance for the life of the servers: five or even ten years. “Historically, failure rates didn’t matter too much if there was going to be a new and better PC every year,” Cutler said. Today, however, hardware isn’t changing as quickly, shifting priority to hardware that can run over longer periods of time to keep costs down.

With no humans going inside the unit to do maintenance, a Project Natick chamber is filled with nitrogen and has virtually no humidity inside. Humidity isn’t just bad for hard drives; one of the main causes of data center failure is corrosion of connectors in the electronics. Over time, moisture gets between two pieces of metal connecting one device to another, eventually pulling them apart and causing a failure. You also can’t make the air too dry because some hard-drive brands use motor grease with some moisture. “If you go down below 10-percent humidity, it starts to turn into powder, and then you have another kind of failure.”

A sealed rack eliminates dust problems, so you don’t need air filters, and the rack can be simpler, without all the quick-release connections for disks and server blades that give technicians the ability to quickly take things apart and put them back together. All that easy access comes with extra cost.

Doing away with data center staff can prevent many problems on its own. “It’s often possible to tell where maintenance has been happening in a data centre; if people work in an area, you’ll start to see increased failure in that area two to three weeks later,” Cutler said. “Whenever you touch something, there’s a risk that something else is affected.”

For some scenarios where you need edge computing, whether that’s in space, on an oil rig, or down a mine, sealed units look like an obvious choice. After a seismic survey on an oil rig, terabytes of data usually travels back to head office on hard drives for processing. Moving that processing workload to the rig itself could give you quicker results. “It’s possible that rigs on the ocean surface will disappear and become automated platforms on the sea bed,” Cutler noted. “You’ll need a lot more compute power to make that work.”

What Microsoft and HPE are learning about fully automated, lights-out data centers in space and underwater could help standard data centers too, whether that’s through automation and self-healing software or through sealed units. Teams inside Microsoft are already thinking about ways to apply takeaways from Project Natick to designing the company’s data centers on dry land, Cutler said. “We take all this back to the data center design community inside Microsoft and try to understand if any of these things make sense to deploy on land to give us an economic advantage, whether it’s being more environmentally friendly or having lower costs.”

Comments

Plain text