GPU Acceleration Makes Kinetica’s Brute Force Database a Brute

The big data solution known as Hadoop resolved two huge, critical problems facing data centers at the turn of the last decade: First, it extended virtual data volumes beyond the bounds of single storage volumes, in a method that was easily managed — or at least, easily enough. Second, it provided a new mechanism for running analytical and processing jobs in batches and in parallel, using an effective combination of new cloud-based methodologies and old data science methods that had been collecting dust since before the rise of dBASE.

The way Hadoop resolved this second issue is at one level complex, and from another perspective, brilliant. But another technology that sat on the shelf for too long — graphics processor-based acceleration — has since ascended to its own critical mass. Now it’s possible for GPUs to run their own massively parallel tasks, at speeds that conventional, sequential, multi-core CPUs can’t possibly approach, even in clusters.

So along comes an in-memory database called Kinetica, whose value proposition is based on its GPU acceleration. A few weeks ago, the firm unveiled its Kinetica Install Accelerator, with a one-year license for a 1 TB installation plus two weeks of personal consultation; and its Application Accelerator, with a 3 TB license and up to four weeks of consultation.

Smash Bros.

What isn’t generally known about Kinetica is how radically simplified its schematics have been engineered to be. Since GPUs accelerate simple operations by deploying their components massively in parallel, Kinetica is. . . shall we say, compelled to process data sets from massive data lakes without so many indexes.

“We want to be a part of that ecosystem where we solve a lot of problems that take four or five Hadoop ecosystem products, kind of patched together with duct tape, to get working,” said Nima Negahban, Kinetica’s CTO and co-founder, in an interview with Data Center Knowledge.

Usually just after an enterprise builds its data lake, it embarks on a noble quest to collect every possible bit and byte of data that can be scarfed up, Negahban told us. But then enterprises have an urge to query the heck out of it.

“SQL on Hadoop is not what Hadoop is for,” he said. “Hadoop is great for being an HDFS data lake, even though that’s not the first reason it was built. And then that produced the whole job-based mentality, where I have a question or I want to generate a model, so let me run a job. That’s different from the need to be a 24/7 operational database, that needs fast response times and query flexibility.”

Just after the data lake model first took root in enterprises, there was a presumption that analytics-oriented data models would soon envelop and incorporate transactional data — the type that populates data warehouses. That is not happening, and now folks are realizing it probably won’t. There will be a co-existence of the two systems, which will just have to learn to share and get along with one another.

But must this co-existence necessarily juxtapose the fast with the slow? Kinetica’s value proposition is that there are methodological benefits that can be gleaned from the new realm of analytical data science and “big data” (which has already become just “data”). But these benefits are best realized, Negahban argues, when they are applied to a reconfigured processing model that makes optimal use of today’s hardware — of configurations that weren’t available when Hadoop first appeared on the scene.

Kinetica utilizes its own SQL engine, said Negahban. By way of API calls, or alternately ODBC and JDBC connectors (for integration with client/server applications), it parses standard SQL, decomposing it into commands that are directed to virtual processors in its own clusters. Those processors are delegated to GPU pipelines. Those connectors enable Kinetica to serve as a back end for analytics and business intelligence (BI) platforms such as Tableau and Oracle Business Intelligence.

“With pretty much any BI platform,” he remarked, “you can drop in our adapters, and quickly start using the tool, just with accelerated performance.”

Call of Duty

Negahban’s journey towards building Kinetica began in 2010, acting in his capacity as CTO of one of the intelligence community’s principal consultancies, GIS Federal.

At that time, the firm was part of a U.S. Army project to converge some 200 separate analytics tools into a single API. This way, developers could produce their own custom applications that utilized this API, rather than cherry-pick one of the 200 tools for which the Army was licensed — at the expense of the other 199.

The high water-mark for contenders for that Army project was whether the right analysis, at the right time, could save troops’ lives.

NoSQL, Hadoop + HBase, and Hadoop + Cassandra were the bases for the Army’s first projects. “Time and time again, the same issues had us arrive at the same conclusion,” related Negahban. “We had too many indexes to try to drive a query, which caused hardware fan-out to explode, and ingestion time to increase. We went from being a representation of up-to-the-second data, to being one day old, and after a while, a week old. We were that behind on ingestion.”

A year earlier, Nvidia had advanced its Fermi GPGPU microarchitecture, its successor to Tesla. Negahban argued that Fermi could have virtually infinite compute capability, so he argued in favor of experimenting with leveraging Fermi as a rather ordinary database engine. . . multiplied by millions. Rather than re-architecting a column store that would graft Hadoop + Cassandra to Fermi, and generating a plethora of indexes that would orchestrate data distribution from big data lakes among big Cassandra clusters, Negahban’s idea was to multiply a brute-force SQL engine across a huge swath of pipelines.

“We were pretty much the first to think of that in a distributed fashion,” he said. “Our real contribution to the GPGPU world has been, basically opening people’s eyes to its ability to be used in data processing, where it’s been so focused on machine learning and the kinds of lower I/O/higher compute problem sets. Data processing and OLAP workloads are heavily I/O intensive, so we were one of the first to say, this has an application in this [new] context as well.”

Since advancing the cause of GPGPU for the U.S. Army, Negahban, along with GIS Federal CEO Amit Vij (also CEO of Kinetica) have found success with another major customer, the U.S. Postal Service.

“That trend that we got to see early at the Army, because they’re such a massive organization,” he told Data Center Knowledge, “we think is going to be a prevailing trend for enterprises all over, where they’ve spent millions of dollars in data creation and data infrastructure. The whole Hadoop movement, oddly enough, is enabling enterprises to have their ‘A-ha!’ moment. ‘I have this massive data lake, I’ve spent all this money on IoT and asset management, I’m dumping terabytes of data a day into my huge Hadoop cluster. Now how do I make this an operational tool that will give real-time ROI to my enterprise?’”

Comments

Plain text