DataStax Brings Graph Databases to Enterprise Cassandra

The term you may hear is “convergence.” The definition, if we’re using an honest dictionary, is the state of affairs when a vendor wants to be your one-stop-shop for every selection in a given product space. There’s nothing particularly wrong with this, if the results add measurable value to your data center above and beyond the obvious cost savings.

In February 2015, San Francisco-based DataStax — which produces a commercial implementation of Apache Cassandra — acquired Aurelius, the producer of a graph database engine called TitanDB. It’s not a database in itself, but moreover a system for interpreting elements of data in terms of their type of relationship to one another. More to the point, a graph database is concerned with how data is related (for example, X is in B’s file because X owes B money), rather than simply that data is related (e.g., X and Y are members of table Q).

What made TitanDB (or just “Titan”) unique is that it didn’t really have to build an entirely new database to function using graph methodologies. In fact, Titan actually required a back-end data store. Some organizations were using Hadoop’s HBase as a back-end, while others chose BerkeleyDB. But many preferred Cassandra, partly because it enabled them to leverage its continuous availability, as well as the absence of a single point of failure.

Multi-modal

As Robin Schumacher, DataStax’ vice president of products, told Datacenter Knowledge in an interview, it’s this seemingly natural dependency between the two components that compelled it to acquire Aurelius, and then to add graph database methodology to its ongoing collection of access modes, for what it now calls DataStax Enterprise Graph.

“The rise of the multi-modal database is born out of these cloud applications, where it’s becoming ‘the new normal,” said Schumacher. “If you can’t handle this in a single database, you are back to what you’ve seen in the relational world, where people will use one data management vendor for transactions, another for analytics, a third for search. (Then) your application has to be smart enough to direct it to the right data management vendor, that has different security paradigms, backup paradigms, etc. Let’s not do that.”

DataStax Enterprise was already capable of addressing a mixed workload problem, he said, where customers could run transactions, analytics, and search in the same cluster. Add to that the capability to run graph traversals, he said, and it becomes feasible to process new and informative styles of workload with the same data already being managed by Cassandra.

Regardless of whether your data is stored in the cloud, on-premises, or a combination of the two, Schumacher contended, cloud applications will mandate that this same data is accessible through any number of modes simultaneously. Retail applications for both desktop and mobile may utilize multiple modules, such as a product catalog, a user profile manager, a fraud detection system, a recommendation engine, a clickstream analyzer, and a log analyzer.

“Each module may have different data management requirements,” he explained. “One may need a data model that is very adept at handling time-series data, and that can write data very, very fast. You may need a model that is more JSON-oriented, if I have a particular Web application that’s communicating with browsers. Maybe I have a recommendation engine, and I need to be able to smartly analyze the moves that you’re making on my Web site or mobile app, do some analysis, understand the relationships between you, the products I’m selling, the various vendors involved, and maybe come back to you with smart, real-time recommendations.”

Traversal

It’s this last mode of access that is best suited for graph database access — in DataStax’ case, accessing the Cassandra data as though it were a graph database. A graph relationship is different from a conventional RDBMS relation, because it begs to be explained on a blackboard using circles, arrows, and geometry.

As database research analyst Curt Monash explained on his firm’s website a few years back, a graph database describes the full relationship between two nodes of data — and here, the “-ship” suffix is extremely powerful. It implies that the associations between nodes can be both qualitatively and quantitatively different from one another. The quality of that difference is a stored property in the graph database. The quantity is expressed as a kind of “weight,” which represents the degree of importance or prominence of a relationship.

The diagram above (courtesy DataStax) represents a typical graph database schema. When a database built upon a schema like this represents the association between a Web site’s customer and an inventory item, the relationship can represent a purchase, a glance on the item’s Web page, or a comment of approval or disapproval. And the weight could perhaps be employed to relate this customer association with all the other customers for that same item, or all the other items that same customer has purchased.

There are ways to accomplish this same type of relationship representation within a conventional RDBMS, explained Monash in a note to Datacenter Knowledge. But none of them are easy.

“You can implement a graph in a relational DBMS, in one or more long, narrow tables,” he told us. “For some use cases that works well. In others, it requires a lot of joins, and indeed an unpredictable number of joins.” By “joins,” Monash is referring to the grafting of tables onto each other to form wider records of relations.

“That's because the number of joins needed is tied to the path length (to a first approximation, it's the path length minus 1), so if the path length is unpredictable, so is the number of joins,” the database analyst continued. “Needing many joins stresses performance. Needing an unpredictable number of joins stresses SQL syntax.”

Masterless-ness

DataStax’ Schumacher argued that this type of unpredictability, as introduced by the requirements of RDBMS and traditional SQL syntax, translates into non-determinism for customers looking to perform real-time analytics transactions.

“The underlying architecture limited you,” he said. “So the beautiful thing we have here is, we build on Cassandra, which allows us no downtime. With the older structures, the best you could do was ‘high availability.’ You couldn’t have continuous availability.”

That continuous availability, he continued, is made feasible through Cassandra’s native “masterless” architecture, in which nodes are continuously replicated, and no single node serves as the “master” over the remainder as “slaves.”

Schumacher did make it clear to us that his company is primarily an operational database vendor, and that DataStax Enterprise addresses use cases that are not as centered upon analytics as typical “big data” deployments. “We are happy to leave the purely analytic and/or data warehousing, data lake use cases to the Hadoop vendors. That said, we certainly support operational analytics in our database, and you need that to be able to make real-time decisions that are necessary for a transactional system.”

Comments

Plain text