Gary Orienstein is Senior Vice President of Products for MemSQL.
The database market has evolved over the decades on the incredible efforts of several single server databases including Oracle, Microsoft SQL Server, PostgreSQL, MySQL, and MariaDB. There are many more; however, these few have furthered the industry with a recipe for building robust transactional systems.
In fact, Oracle and Microsoft SQL Server, by far the two most popular commercial single server databases, are the driving forces behind the combined 65 percent market share for the two companies.
Single server databases provide an architectural simplicity that is hard to beat. You have a single process running on Server 1, then Server 2 provides high availability.
So why change?
The short answer is that in today’s world of modern internet-connected applications, single server systems no longer provide the performance or scale to meet application requirements. A prominent example of this is Uber who uses many distributed systems, including datastores, to solve real-time application needs.
Let’s explore the challenges seen with single node systems when applications drive large volume, performance intensive workloads, and suggest how companies can identify and transition to the right distributed systems.
The Pros and Cons of Single Server Systems
Single server databases provide a number of benefits including:
- Architectural simplicity of operations happening within a single server
- Rich feature sets based on years of development
However, they also come with some limitations:
- Datasets must fit within the capacity of a single server or storage area network (SAN)
- Performance is limited to the speed of a single server
Moving On At The Right Time
The first maxim about moving on from a single server systems is that if your current system is working, there is no need to change. Distributed systems represent new ground for many database professionals and while an important mechanism for scale, they should not be approached prematurely.
Distributed systems aggregate the power of many servers or cloud instances working together. They represent a single logical system, but utilize the CPU cores, memory, network bandwidth and storage of all the servers together. While running a distributed system provides far more performance and scale than a single server system, they require a level of systems and database management expertise beyond traditional approaches.
In an article based on her forthcoming book on cloud-native architectures, Anne Currie writes:
Nowadays I spend much of my time singing the praises of a cloud native (containerized and microservice-ish) architecture. However, most companies still run monoliths. Why? It’s not because we’re just wildly unfashionable, it’s because distributed is really hard. Nonetheless, [distributed systems] remains the only way to get hyper-scale, truly resilient and fast-responding systems, so we’re going to have to get our heads round it.
Signs Of Change
There are a few telling signs when it might be time to change from your single server database. They typically relate to the following workload metrics:
Slow Ingest: Modern applications such as sensor data collection or internet-facing mobile apps require the ability to rapidly ingest and store large amounts of data. A single server database often has a fixed amount of ingest throughput since it runs on a single machine. The limits could be I/O or memory, storage capacity, processing power or a combination of these. These constraints can become a liability for applications intended to scale.
Long Query Wait Times: In addition to collecting data, applications require the ability to understand data through analytical queries. With a single node database, those queries can take minutes or hours if they require a significant amount of processing. Queries take longer if much of the needed data resides on disk drives. This extended time may not be sufficient to meet business requirements.
Limited Concurrency: Applications frequently need the ability to support a large number of users. That might include dozens or hundreds of analysts within the company who require data access. It may also include customers or partners who need simultaneous access to important data. Concurrency determines the amount of unique database connections that can be supported at a single time, and most single node databases cannot scale to support large numbers of concurrent users.
Constrained Capacity: Finally, capacity alone might emerge as the bottleneck for a single server database. While it is possible to expand the capacity of a single server database through networked storage, the processing of that capacity is still restrained to a single system. Many modern applications have large real-time and historical data volumes that require performance as well as capacity.
Graduating to Distributed Systems: If the above system bottlenecks start to emerge, it is time to consider a new solution. One approach is to separate analytics from the transactional system through the introduction of a separate data warehouse. However, this requires implementing an ETL process (Extract, Transform, and Load) between the database and the data warehouse, and thereafter the data warehouse is always inherently a bit out of date.
A more robust solution is to solve the constraints of the single server system with a distributed system, which provides the following benefits:
Capture Real-Time Data with Fast Ingest: With a distributed system, data can be ingested across multiple server nodes providing higher throughput to collect massive data streams. Some distributed datastores also provide easy connectivity to other popular distributed systems for messaging and transformation such as Apache Kafka and Apache Spark, which can be used to build real-time data pipelines.
Deliver Immediate Insight with Low-latency Queries: Datastores that provide both transactional and analytical capabilities benefit from eliminating the ETL process. But those datastores also need the ability to respond to sophisticated queries instantly. Distributed databases that enlist the power of multiple nodes working together can deliver the query performance needed for both real-time and historical data at scale.
Support Large Numbers of Simultaneous Users: Distributed systems spread workloads across multiple nodes and can sustain support for many users under heavy load. This is a challenge for a single server database that can only support a limited number of connections. As companies build more customer- and partner-facing applications, as well as offer analytics to more internal users, the requirements to support a large number of concurrent users grows.
Storing High Volume Data: It is no secret that data volumes continue to grow. Distributed systems do well in handling this data as they can scale-out to support a large number of nodes and storage capacities at petabyte scale. This allows companies to store both real-time and historical data for the most accurate and context-aware analytical queries.
Charting A Path to Distributed Computing
As the world moves towards higher data volume and real-time applications, the need for speed, scale, user support, and data capacity mean that single server systems are often unable to meet the needs of modern applications. Moving to a distributed database architecture can alleviate the constraints of single server systems, and companies aiming to stay ahead must develop the relevant skillsets to manage and operate new types of distributed databases and data warehouses.
Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Informa.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.