I can still remember a time in my distant past when data protection still made me smile.

Life was simple back then: Data was attached to a single application, both were hosted on a singular marvel of modern information technology engineering – the SAN. We had redundancies piled upon redundancies, compression, deduplication, snapshots, clones, incrementals, differentials, inline, and out-of-band, among others. In short, we had it all!

But time marches inevitably on, and soon enough we started to want to share data across and between appliances, and with applications we didn’t write, and potentially didn’t even host. Instead of keeping all of our data centrally in a lovingly crafted – and expensive – bespoke repository, we started keeping it here, there, and everywhere, like so much loose change.

Then an era of new, non-infrastructure data owners sprung up. Steeped in the dark arts of corporate administration, they promised better performance for local apps, simplified access for their remote (sometimes third-party) teams, and cheaper storage costs.

Big cloud
– Xilinx

While we knew deep down they were right, we lamented the loss of traditional backup. Simplicity in data protection had come to an end. Recent events have only exacerbated this long-simmering problem:

  • Covid-19 response measures have driven even more applications to further flung corners of an increasingly distributed enterprise.
  • Business-focused data owners care most about extracting value from the asset. Increasing efficiency and mitigating risk are often immaterial to them.
  • The rise of IoT has advanced digital twinning and there are armies of data scientists who want to mash those petabytes of information against external normalized or controls sets of equal or greater size.

But enough hyperbole – what are the actual numbers here? How big a problem do I really have in modern data protection?

Tempting though it may be, we, unfortunately, can’t put the genie back in the bottle. Data demand is skyrocketing, vendors – especially as-a-service vendors – are lining up to take our money, and infrastructure is, as always, responsible for holding it all together.

Fortunately, backup-as-a-service offerings have evolved rapidly over the past few years. But don’t just blindly believe the hype. Let’s do our due diligence so we know what to look for in a good BaaS platform.

Data growth

All cloud services are rapidly scalable – it’s part of the very definition of cloud. But we need more than a vendor that can scale up, we need an intelligent way to respond to our growth. This can include:

  1. Data optimization technologies. Upon ingestion, can we further compress the data? Perhaps not, depending on the raw data type. Can we leverage the similarities between time series copies of the data? What black magic can we employ to ingest 10TB of information and yet only write 8TB to our target?
  2. Storage optimization technologies. There are many targets to which we could write our 8TBs. Each target has its own characteristics, not the least of which is cost. Some are going to be public cloud accessible, but I also may need on-prem security. A good service will present me with an array of options.
  3. Preservation technologies. How certain are you that the data you wrote five years ago is retrievable today? How certain do you need to be? Data rot may have originated with archivists and crumbling bits of paper in libraries, but it is just as real a threat today. A good solution will reassure me the data I wrote last decade has the same fidelity as the data I wrote last night.

Data governance

Part of the reason infrastructure practitioners liked that SAN so much is that all data was treated as a first-class citizen. Cat videos and iPod music libraries had the same redundancy and backup as customer sales data. If everyone was equal, we didn’t need governance.

Today, we can no longer afford such luxurious simplicity. So, we’re left with the unenviable task of imposing a governance schema onto data we may be responsible for, but don’t actually own, nor do we necessarily fully understand. So how can an infrastructure solution help us with this key challenge?

  1. Compliance templates. Vendor templates can help make these broad complicated control objectives specific and attainable.
  2. Ownership and tagging schema. A backup solution must be able to identify and track ownership of data elements and have a set of attributes that serve to describe the data so as to sufficiently take care of it.
  3. Automated classification. AI algorithms promise to tag data upon ingestion, automatically identify PII, HIPPA or GPDR data (amongst others) and apply intent based policies. And while they claim 95 percent, 99 percent, and 99.9 percent efficacy, we all know that isn’t nearly good enough.

Data locality

One truism has stood the test of time: Data does indeed have gravity. As the amount of data in a set rises, it becomes exponentially more difficult to pull it back to a central location. An effective solution, therefore, is to go to where the data is, rather than expecting the data to come to it.

  1. Cloud transit. It may not cost you money to bring data into a public cloud, but it certainly does to get it out. How will this affect your backup solution? Does it have to be represented inside your VPC or does the solution support multiple CSP hosts to minimize your data movement?
  2. Hybrid cloud. While some people have successfully gotten rid of on-prem in its entirety, the rest of us have to deal with a hybrid reality for the foreseeable future – and we’ll need a solution that integrates with our legacy appliances and applications, too.
  3. Public is anathema to private. When I’m in a public domain, how can I ensure the privacy of my backup while it is in transit as well as when it finally comes to rest? I may not know myself, but I’m certainly expecting my as-a-service vendor to have some awfully good ideas.

Data dispersion

In a way, unstructured data is almost easier. No matter where they come from at the end of the day, they are just files. Assign some metadata and we’re basically home free. Information inside an app however is arranged in a structure dictated by the designers and programmers of said app. And if my portfolio has 129 apps, that means potentially 129 different data architectures I have to unravel and interact with in order to backup my information.

  1. Third-party support. A good backup-as-a-service provider is going to have done the hard work for you and be able to understand the data architecture of the most commonly used apps. Whether you’re using Microsoft 365, Salesforce, Concur, Zendesk, Workday, Twilio, Shopify, GSuite, or all of them, your vendor can and should help shoulder the load.
  2. API support. Modern SaaS applications have some degree of built-in data protection. Of course, on their own they’re likely not sufficient for your enterprise needs, but in the interests of efficiency why bother replicating what you’ve already paid for once? Public-facing APIs can let a backup solution manipulate the information internally and coax it towards compliant behavior based on a policy you’ve set. Ideally of course, we still want to be able to supplement that in-app manipulation when it inevitably falls short of our policy intent.

In conclusion, these are all really hard things to do! Fortunately, we have (some) money and there is a huge marketplace of vendors who want that money. But let us take a moment to remember what vendors really are – specialists in doing what is hard for everyone (or at least most people), the broad strokes. While it’s convenient to imagine that all we have to do to solve our modern-day backup problems is to write a check, we can’t afford to lose sight of where our true accountabilities lies.

Your company won’t be satisfied with just the broad strokes. Your company wants attention paid to their details, and needs someone to own solving those problems that are hard for them. This is where you come in. You are the expert who can help sift through the firehose of options vendors present to solve the general case. No service provider, vendor or consultant knows your company better than you do. This is your job. And it’s way more glamorous and provides much greater value to your stakeholders, than your ability to initiate an iSCSI target on a LUN.