Online auction giant eBay introduced an open source real-time analytics framework called Pulsar. eBay said it was using Pulsar in production at scale and was now making it available for others. Pulsar is licensed under the Apache 2.0 License and GNU General Public License version 2.0.
Pulsar is an example of a wider bifurcation occurring in the realm of handling massive amounts of data companies now have access to. There are quantity needs for batch processing and analytics needs for on-the-fly analysis. Pulsar was built in response to real-time data handling needs.
The company uses Hadoop for batch processing, delegating real-time analysis of user interactions to Pulsar. Batch processing has been successfully used for user behavior analytics, but newer use cases demand collection and processing in near real time, within seconds, according to the company. Real-time analysis leads to better personalization, marketing, and fraud and bot detection.
These real-time needs prompted the company to build its own Complex Event Processing framework. It was built to be fast, accurate, and flexible.
Pulsar is capable of scaling to a million events per second, according to a company blog post. It has sub-second latency for event processing and delivery. There’s no cluster downtime during upgrades and topology updates, and it can be distributed across data centers using standard cloud infrastructure.
Pulsar also includes a Java-based framework so developers can build other applications atop.
Pulsar uses an “SQL-like event processing language,” according to Sharad Murthy, eBay's corporate architect, and Tony Ng, the company's director of engineering -- the blog post's authors. It is used to collect and process user and business events in real time, provide key insights that systems can react to within seconds.
Atop of the CEP framework the company implemented a real-time analytics pipeline, which relates how different parts can work together. Some of the processing it performs includes enrichment, filtering and mutation, aggregation, and stateful processing.
The pipeline is integrated into different systems. Two examples given are sending events to a visual dashboard to show real-time reporting, or tying it to backend systems that can react when certain things happen.
Developers can run SQL queries for analytic purposes. “In Pulsar, our approach is to treat the event stream like a database table,” said Murthy and Ng on the blog. “We apply SQL queries and annotations on live streams to extract summary data as events are moving.”
eBay plans to include a dashboard and API for integrating with other services.
eBay is smart when it comes to handling and visualizing data, and how it relates to the bigger picture. In 2013, eBay unveiled Digital Service Efficiency dashboard at the Green Grid Forum. The DSE is a system of metrics that ties data center performance to business and transactional metrics. In short, it shows how turning one knob affects other parts of infrastructure. The dashboard sums it all up with a “miles per gallon” measurement for technical infrastructure.