The interchange of information and data continues to rise. Large organizations are now working with more customers and end-users that all have data-related requirements. As data growth has continued to rise, many IT environments are being challenged by the scale of their datasets.
Most databases originally started out with kilobytes of information. With the expansion of technology and the use of more database tools, our data vocabulary has expanded to include giga, tera, and even petabytes. The reality is that this information continues to grow.
Let’s analyze some numbers:
- According to IBM, the end-user community creates 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.
- Every hour, Walmart controls more than 1 million customer transactions. All of this information is transferred into a database working with over 2.5 petabytes of information.
- According to FICO, the Credit Card Fraud System currently in place helps protect over 2 billion accounts all over the globe.
- Currently, Facebook holds more than 45 billion photos in its entire user base, and the number of photos is growing daily.
- Finally, The Economist recently pointed out that we are now able to decode the human genome in under 1 week – where it took 10 years to do so originally.
The challenge isn’t only to collect all of this information and keep it safe. Organizations must actually make this data usable by understanding patterns and correlating user information. This type of demand has opened up the markets for all sort of innovative big data control mechanisms, setting the stage for innovative approaches to "Big Data" from both private industry and the open-source community.
A Look at the Big Data Market
Big data management is a booming industry, with more than $10 billion being pumped into solutions only dealing with challenges around big data. There have already been some major players identified in the community, both from the private industry and the open-source sector. The interesting wrinkle is that some of the open-source platforms are fueling the private solutions being offered by some big vendors.
- SAP Sybase IQ. About two years ago, SAP purchased database giant Sybase for about $5.8 billion. Since then, SAP has been working hard to develop better systems to manage big data sets. The SAP Sybase IQ server was introduced as a cost-effective highly optimized RDBMS data analytics engine. Having the capability to run complex data analysis with good performance makes this a formidable product currently available in the industry.
- Oracle Big Data Appliance. As data flows into a system, many times it may be fragmented or unstructured. This is where the Big Data Appliance comes in. Designed as a system optimized for collecting, organizing and managing massive volumes of data, this product will help load data directly into an Oracle Database. Bundled with Sun servers and built on the Cloudera model (using Hadoop to organize the data) the Big Data Appliance can help organizations manage and use their large data sets efficiently.
- HP Information Optimization Solutions. In June, HP introduced its new platform which will help users collect, organize and use their large data sets. Using a solutions-based approach, HP offers a line of products capable of handling large volumes of data. For example, HP AppSystem for Apache Hadoop is an enterprise-ready appliance which integrates with HP’s Converged Infrastructure to help manage and scale out large data volumes. Incorporating other elements, like the newest Vertica 6 (HP Vertica Analytics Platform), help organizations process data and truly examine real-time data analytics.
- IBM Big Data Platform. Like HP, IBM took a platform approach to the big data question. Their model allows users to visualize and discover the data that they are trying to work with. Then, they use Hadoop-based analytics to manage these large data volumes and help the whole platform scale as needed. Other key platform deliverables include the all-important data set warehousing capability and the new textual content analyzer. This is where text within an unstructured data set is analyzed for patterns which can then be used to better understand the information.
Progressing much of the way has been the open-source community. With numerous different technologies becoming available for big data management, organizations can now make solid decisions based on the platform which suits them best.:
- Apache Hadoop. Hadoop is quickly becoming the standard when it comes to open source big data management. Creating for truly intensive distributed application usage, Hadoop continues to be a powerhouse platform which is being integrated into numerous enterprise solutions. It’s designed to run on commodity hardware and is capable of working with structured, unstructured and semi-structured data sets.
- Cascading. Currently used by Twitter, Cascading is another great open source tool which helps create and execute workflows. Like several other systems, Cascading is actually a software layer which runs on Hadoop. Used for log file analysis, ad targeting, and data prediction analytics, Cascading quickly became a popular tool for numerous organizations dealing with large data sets.
- Apache HBase. Adopted by Facebook to help with the messaging platform, HBase was built and modeled around Google’s powerful BigTable management platform. This open source, Java-coded, distributed database was designed to run on top of Hadoop’s already popular platform.
- MongoDB. Originally from the makers of DoubleClick, MongoDB has already been adopted by orgainzations such as Disney, The New York Times, craigslist, and several others. Built on a NoSQL, open source engine, MongoDB is able to store structured data in JSON-like schemas.
Because we live in a data-on-demand society, the growth of large datasets seems inevitable. The amount of available information and data will only continue to grow as users adopt more devices which are connected to the Internet and beyond. Now, many organizations are working hard to collect this data efficiently and have it work for them. The data being collected must be analyzed and processed to have some type of ROI. With advancements in IT consumerization and the growth of the cloud – users will only continue to grow their usage rates; producing yet more data which will need to be monitored, collected and quantified.