Skip navigation

Managing Unstructured Data at Scale

As-a-service delivery means that IT staff can minimize the time they spend on non-strategic tasks such as executing backup and archives.

Andy Ferris is Product Manager for Igneous Systems.

Organizations with petabytes of unstructured data face unique challenges when trying to manage that data. At petabyte scale, traditional methods of data protection and management begin to break down; if they don’t simply stall out due to the sheer volume of data, they become too expensive, overly restrictive, or impossible to manage. With this in mind, a new strategy is necessary.

An effective data management strategy for organizations with petabytes of unstructured data must be:

●      High performing, to keep up with the massive volumes of data

●      Easy to manage across a diverse storage environment

●      Cost effective and priced for growth

Many backup and archive solutions are built to meet some of these criteria, but very few can meet all of them. The two main strategies that enterprises use today, tape backup and disc-to-disc replication, both fall short of meeting all three criteria.

NDMP-based Backup

Traditional tape backup was designed when the very largest organizations had GB-scale datasets. At the time, tape backup using NDMP based software such as Commvault or Netbackup provided inexpensive data protection and could double as an archive strategy. The manual labor involved in setting up and managing tape-based backups were negligible compared to the performance, price, and protection tape provided.

However, as the amount of unstructured data at the average Fortune 1000 company has grown into petabyte-scale volumes and billion-object counts, organizations have found it impossible to scale tape workloads accordingly.


NDMP-based data movement protocols are single-threaded and resource-intensive. Therefore, backups require nearly 100 percent of available resources. Normally, they run overnight, inconveniencing staff and applications that would also like to use resources overnight and exposing an organization to the risk if backups fail to complete. Additionally, it’s notoriously challenging to recover data from tape backups, requiring hours of labor for a complicated process. All of this creates significant burdens on IT admins.

Even the cost advantages of tape are not what they seem at first glance. Although storing a single TB on tape is inexpensive, that is a small piece of the true cost of using tape for data protection. Because a typical tape workflow will store weekly backups for a full year, organizations relying on tape will need nearly 30PB of tape in order to backup 1PB of data. Add to that the costs of backup software, tape libraries, tape drives, and the immense amount of labor required to set up and maintain these workloads, and the total cost of tape is far from a bargain.

Some organizations have determined that while tape is not scalable, they can use a disc target for their NDMP-based workflows instead. This solves some of the cost and recoverability issues with tape workflows, but it doesn’t solve the scalability problems. Since these solutions rely on the same resource-intensive data movement protocol, they still leave organizations with a heavy management burden, no room for growth, and a high likelihood of leaving vital data unprotected when a backup window closes.

Disc-to-Disc Replication

Disc-to-disc replication is optimized to be scalable and high performing, so that backups complete comfortably and data recovery is relatively fast. This covers two of our three requirements. 

Products such as Isilon’s SyncIQ and NetApp’s SnapVault can replicate large quantities of data and provide easy access to backup data. The costs of these solutions are usually bundled with primary and secondary storage, and can scale effectively.

However, these solutions are vendor-specific, requiring secondary tier appliances made by the same vendor. Most organizations with large amounts of unstructured data have primary NAS systems from multiple vendors, in order to optimize costs and performance across their networks and geographies.

In effect, these vendor-enforced siloes of primary storage, secondary storage, and management software create problems for organizations, as they struggle to manage disparate systems and workloads. These workloads might be able to keep up with the rate of data creation, but for many, it’s not worth the hassle.

Optimized Solution

If an organization with petabytes of unstructured data were to describe an ideal solution, they’d have a few requirements in mind. As mentioned above, the ideal solution would be high performing, easy to manage across a diverse storage environment, and cost effective at scale.

We can see that modern tools on the market today have made significant improvements over tape and replication backup solutions: first, efficient multi-threaded backup protocols that can achieve the performance required to manage petabytes of data; second, simpler and faster search-based recovery that’s critical at scale; and third, primary storage agnostic software to simplify data management in complex environments. As unstructured data continues to grow in size and value, these improvements are essential.

To be truly optimized requires a high-performing, vendor-agnostic product designed specifically for petabytes of unstructured data, at a cost-efficient price point, and delivered as-a-service. As-a-service delivery means that IT staff can minimize the time they spend on non-strategic tasks such as executing backup and archives, regardless of scale and growth rates of unstructured data. Organizations with petabytes of unstructured data can optimize their data management strategy with a modern data management solution that offer all of these benefits in a format that minimizes workloads on IT staff and maximizes scalability.

Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Informa.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating.


Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.