Piz Daint – Cray Swiss National Supercomputing Center
Piz Daint, the world's sixth fastest supercomputer (as of June 2018), designed by Cray for the Swiss National Supercomputing Center

Supercomputers Have Been Slow to Adopt Flash Storage – Cray Wants to Change That

Cray is preparing to launch Flash storage arrays for supercomputers • The high-performance computing space has been slow to adopt flash, relying almost completely on spinning disk • But as HPC workloads are changing, demand for random access to data is rising

High performance computing isn’t just for research laboratories. Plenty of organizations use HPC for analyzing large amounts of data from many different sources thanks to the shift from dedicated hardware to clusters of servers, giving them a data center’s worth of hardware all solving a single problem. Hardware vendors like Cray have been able to build supercomputers out of commodity servers, taking advantage of improvements in processing power and in cluster management software. But HPC has lagged the modern data center and the growing hyperscale public clouds in one critical area: storage.

Look at any HPC installation and you’ll see a lot of spinning disks. Much of the file system code that they use focuses on parallel storage, and until recently that’s meant working with hard disk drives. IOPs haven’t been important enough for HPC workloads to justify the cost of flash storage, John Howarth, Cray’s VP of storage, told Data Center Knowledge in an interview. “We've been watching flash for a while; it's been dominant and successful in the enterprise space, but our customers have always valued cost and throughput as opposed to IOPS.”

But as workloads on HPC systems have started to change, there’s an increasing demand for random access to data. Throughput-based workloads still dominate, but now there’s enough interest that Cray will soon start offering pure flash storage alongside existing HDD and hybrid systems. By delivering a flash system that’s packaged as something familiar to its customers (and fits in the same storage carriers and enclosures they’re already using), Cray is hoping for quick adoption. Product marketing director Mark Wiertalla noted, “They know disks, they know file systems on top of disks, they don't want to rewrite code or write optimizations or change their workload management scripts, their admins don't want to learn new tools.”

In August, Cray plans to launch its 24-SSD, 2U L300F storage arrays, using Seagate SAS flash hardware and the recent 2.11 release of the Lustre open source parallel file system, which, according to Wiertalla, “puts us at the forefront of the Lustre roadmap in a way we haven't been for a long time.”

Lustre is little known outside the HPC world. “It’s pretty stable, reliable, and robust,” Howarth said. But there is a tradeoff compared to more mainstream storage systems. “It costs a lot less, but you get less availability, and the characteristics are different from other file servers [that have] five 9s capability.”

Lustre 2.11 adds support for progressive file layouts and for data on metadata – features HPC customers have been asking for. With PFL there’s no need to know the size of output files before they’re saved while still giving good performance for all types of file, from large single files, to many small files. For data-on-metadata, Lustre’s metadata server architecture stores file metadata separately from your data, treating it as a queryable online transaction store using in-memory operations to speed up the filesystem.

Cray has focused initially on using the L300F as an object store, supporting customers who need intermediate-result storage. “We see flash as being the best medium to satisfy the combination of metadata performance and performance for objects,” Wiertalla told us. Additional features, such as progressive file layouts, will come later in the year with software updates. “We've taken the approach of making it simple and using flash like disk. It looks like the same kind of server attached storage we’ve been doing for a long time. We’re starting simple and it provides immediate benefits for users and admins.”

Slow Dash to Flash

While existing data parallel workloads will continue to be supported by HDD file systems, Cray’s flash hardware is better suited for new low-latency workloads, such as training machine learning systems or simulation and visualization, that don’t have the same storage profile as classic parallel computing problems. With these mixed workloads, latency is key, as data on flash is read, written, and reused multiple times before being written to HDD. “We could never get there with hard drives, no matter how much we scale up the disks; you always have limitations of spindle speed and arm movement,” Wiertalla said.

The L300F isn’t Cray’s first flash product. It’s already shipping a hybrid flash and HDD array, the L300N, which delivers the advantage of flash caching without making customers optimize applications for flash. “Its sweet spot is that mixed workload, where you really don't know when the IO profile is purely bandwidth or IOPS, when you really don't know if your block size is tuned for disk or should be held by flash.”

But the L300F is Cray’s first step into the future of HPC storage and how it will affect operating system and cluster design.

“What limits flash from a commercial perspective is that it’s still more expensive than disk,” Wiertalla said (up to 13 times more expensive per gigabyte than standard 7200rpm hard drives). The higher price has been putting Cray customers off flash so far, which is why the L300F isn’t for primary storage. “When they look at cost per IOPS, flash is a lot less expensive than disk, so they’ll use it tactically.”

Flash prices continue to drop, with the gap between it and HDD narrowing, as TLC takes over from MLC. “Eventually we expect the gap narrows to the point that it’s just easier to use flash as primary medium and relegate disk to archive applications and cold storage.”

That won’t happen quickly though for the HPC market, because it’s not just about the cost of flash, or even the performance. Transitioning to flash and then to NVMe will require rearchitecting file systems, removing code optimized for HDD and making changes at the OS level around locking, as well as redesigning the physical enclosures for NVMe. It’s likely to be a slow transition, because it requires changes to HPC operating systems as well as in hardware and storage technologies.

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.