Imagine a server going down in an edge computing cluster at the foot of a cell tower hundreds of miles away from your nearest available server technician or a warehouse storing replacement hardware. You’d be very lucky if you got the machine replaced inside 24 hours and getting a tech out to do it wouldn’t be cheap.
Now imagine having 50 such locations, a hundred of them, or a thousand, hosting a distributed platform running critical applications for customers. What kind of service level agreement do you think you’d be able to guarantee those users?
This problem, the operating model, is one of the biggest puzzles companies building the earliest distributed edge computing platforms are trying to solve for today. How do you keep so many remote sites running at a feasible cost level?
Solving this puzzle is one of the big design goals behind Open19, the data center hardware standard born at LinkedIn that’s now overseen by the non-profit Open19 Foundation. What if installing a server was so simple a delivery driver could do it? What if you could simply keep a stack of replacement servers near your edge cluster, and when a live server in the cluster failed, a robotic arm would take it out and slide a new one in its place? What if a self-monitoring system would notice a server about to fail, automatically order a replacement, and power the old server down just in time?
In the future, you could have edge data centers everywhere: cell towers, factories, retail stores, racetracks – wherever computing power is needed to ingest and crunch data to make decisions on the spot, without the latency of connecting to a central data center that could be hundreds of miles away.
More Than Racks and Chassis
Open19 started with a uniform chassis and connectors multiple vendors could design to. That standardization, along with hardware isolation inside the rack, self-monitoring, and self-healing provisioning systems, are all pieces of the puzzle in creating fully automated, or “lights-out,” edge data centers, Yuval Bachar, LinkedIn’s principal engineer for global data center architecture, who is also Open19 Foundation president, said in an interview with Data Center Knowledge.
Numerous companies have gotten involved with Open19 specifically because of its usefulness at the edge. They include LinkedIn’s foundation co-founders Vapor IO, which builds data center infrastructure and software for edge computing; and Packet, which is starting to extend its cloud platform to cell towers. The US wireless tower giant Crown Castle, an investor and partner of Vapor, joined the foundation this year.
Bachar will be talking more about Open19 and its benefits for edge data center deployments at Data Center World Global in March, including an in-rack liquid cooling system that’s in the works. The system will be for high-density compute used for the machine learning applications, one of the workloads expected to proliferate at the edge, and next-generation network switches, which he expects will reach similar power densities. Register here
Ready to Plug in Wherever
The design is ready for whatever power supply is available, which is important when you have many different locations. “Our power shelf is universal,” Bachar said. “Any power you give it – AC, DC, single or multi-phase – everything is taken in to the power shelf and distributed to the system, so the servers are agnostic of the environment they operate in.”
Open19 uses disaggregated hardware with full power-supply isolation between the servers. “We don’t have any bus bars shared in the rack; every server is completely protected, monitored, and enabled by a separate channel. The power channel is isolated, and there's an e-fuse for every one of the servers for protection.”
Remote Monitoring More Important Than Ever
The individual e-fuses also provide real-time power consumption data, which can reveal emerging hardware problems. “If we see a server with fluctuating power consumption, that’s usually an indication that something is wrong,” Bachar explained. “Maybe there are a lot of writes into memory, or the disk drive is not performing.” Track heat fluctuations as well and you can see network failures or problems with load balancers, he said.
LinkedIn uses that information for proactive hardware maintenance, pulling servers with unusual power activity for testing before a problem affects the workload. Extend that predictive maintenance to create self-healing systems that orders a new server before there’s a hardware failure or data loss and you get a fully automated environment that’s ideal for the edge, with no permanent staff. “There’s nobody at a cell tower,” Bachar noted.
Plug and Play
Open19 moves all cabling to the back of the rack, and the connectors are designed so that a new server slides into place and connects. That means a delivery driver could theoretically replace a server without accidentally disconnecting something, damaging a connection, or compromising airflow by leaving cables in the wrong place.
LinkedIn has written software for bringing servers online automatically, once they’re plugged into the rack. “The provisioning systems are already automated,” Bachar pointed out. “Inserting a server into a pool of production in Open19 means plugging it in, and when it’s plugged in, the system autodetects it and auto-provisions it.”
Both monitoring and provisioning software the company uses for its Open19 infrastructure will eventually be open sourced as a separate project, Bachar said. It may do so under the Open19 Foundation or “with other open source partners we have.”
But, the Open19 platform works with existing management software, he added, so customers can use their current automation infrastructure the way LinkedIn uses its own today.
Robots to Swap Servers?
For locations with the space to keep a stack of replacement servers sitting ready at the end of the row, Bachar envisions a robotic hand that can remove a faulty server and insert a replacement, much like a tape robot changing tapes in digital tape archives today.
“That can lead us to much darker data centers where we don't have people twenty-four-seven,” he suggested. “They [would be] on call for critical situations, but changing servers is something we can do automatically and remotely by an automated system that can learn the situation and be proactive with replacing the server.”
An edge data center could be automated all the way from detecting hardware problems to ordering and installing a new server and setting it up. That’s something that would usually take days or weeks, depending on how difficult the location is to get to. Open19 is moving toward a world where once a delivery driver or a robotic hand gets to the rack, the server is online in seconds.