It comes as no surprise that cloud resilience is a top IT buzzword of the 2020s. Ensuring resilience against cyber-attacks and ransomware extortion, as well as the ability to recover from IT disruptions quickly, are critical imperatives for organizations today. Without a resilient IT and application infrastructure, operational business processes are susceptible to breakdowns.
All the big cloud providers offer services and features for resilience. However, no CIO or IT professional should assume that shifting all workloads to the cloud guarantees complete resilience. The clouds offer building blocks, not ready-to-play-with fairytale castles. Instead, security architects and business continuity management experts must combine features and services cleverly.
Figure 1 provides guiding roadmap that highlights four central scenarios for cloud resilience:
- Solution-internal resilience (1) looks at the challenges of applications or databases crashing without the impact of external events or issues in the underlying infrastructure or any impact from other components.
- Infrastructure resilience (2) addresses problems in the underlying hardware or technology layers and the network.
- Crash cascade resilience (3) aims to suppress domino effects, i.e., that the crash of one application impacts others.
- Cyber-attack resilience (4) for dealing with external attackers who break into a data center cloud tenant.
Scenario 1: Solution-Internal Resilience
The main risks solution-internal resilience must cover are coding and configuration errors, unexpected data constellations, and peaking resource requirements.
Resilience regarding workload peaks is much easier to achieve in the cloud. First, platform-as-a-service (PaaS) packages such as Cosmos DB have autoscaling features. Second, in the infrastructure-as-a-service (IaaS) cloud world, load balancers combined with groups of VMs (e.g. Azure Virtual Machine Scale Sets or Amazon EC2 Auto Scaling) is an e easy-to-implement solution. This approach guarantees that there are always sufficient VMs by scaling up and down depending on demand and replacing crashed VMs with new ones.
With such powerful preventative features, the classic corrective pattern – increasing or replacing the hardware and resources, restoring the backup, and restarting the application – moves to the background.
The main preventive measures to increase resilience for coding, configuration, or data constellation issues are more testing and better software design. If bugs slip into production, causing a crash, fixing the bug and redeploying the code is the university textbook corrective measure. While a necessity for repeated crashes, restarting the application – “Have you tried turning it off and on again?” – is an immediate tactical measure to bring the application back online. Scale Sets and similar services automate these self-healing restarts, though application teams should investigate frequent crashes. Finally, as always, restoring a backup is the last option, be it the configuration, the data, or the application code.
Scenario 2: Infrastructure Resilience
Failures in the hardware or network layer sound like something from the 1980s but are still an issue today. In the IaaS world, the application teams must handle VM and disk failures. Manual restart is the default (recovery) option. However, the already mentioned Scale Sets (and similar services) are convenient preventive measures in the cloud to minimize the likelihood of outages.
The approach differs for PaaS services such as storage accounts, Amazon S3 buckets, DBaaS, or Lambda Functions. Many offer various redundancy options for customers to choose from. Ideally, an organization’s cloud platform team defines and enforces minimal requirements for production environments. Then, all operational responsibilities are with the cloud provider.
The network layer has more facets. Customers decide how to set up the connectivity between clouds (e.g., their AWS and Google Cloud tenants) and between on-prem data centers and the cloud. Does an organization connect with GCP via the internet or the more reliable GCP Cloud Interconnect service? And if with Cloud Interconnect, does the organization rely on one network carrier, or do they partner with two or more? The customer decides. They also set up their routings and DNS services. However, they rely entirely on the cloud provider regarding the lower layers of the network backbone and connectivity within the data centers.
Scenario 3: Crash Cascades Resilience
Crash cascade resilience addresses the necessity that a crash of one application should not impact other applications, thereby causing domino-style cascading application crashes. For example, a bank should ensure that issues in the core banking system do not affect the ATM solution, which approves (or declines) money withdrawals from customers around the globe in real-time, 24/7. However, architects and managers must understand that there are clear limitations.
Resilience patterns can buy some time in this context – maybe five minutes, five hours, or five days. The bet is that the application comes back online before there is any impact on others. As with the money withdrawal example, such patterns can only be temporary solutions. No ATM application can operate for weeks without updates of the customer account balances and credit scoring changes.
One implementation pattern is straightforward: asynchronous integration patterns for application interactions, i.e., batch processes, messaging queues, and pub-sub. In contrast, (Rest-)API calls are simply evil. They cause applications to fail even if the counterparty system is down just for a second (or applications must implement complex failure handling logic). There is just one crucial footnote for asynchronous integration patterns. They rely typically on (messaging) middleware. The availability of this middleware is vital for the overall application landscape.
In the end, the cloud is not a game changer for this resilience scenario, though the clouds provide ready-to-use middleware and ease restrictions on unwanted, direct inter-application connectivity, which forces applications to use middleware gateways. Furthermore, resilience against crash cascades is application-specific and even only partially an IT topic and more a business design topic. Does the business allow the ATM solutions to approve cash withdrawals based on yesterday’s data if the core banking system is down? Are limited withdrawals even possible if an ATM cannot reach the ATM solution? Only the business, in collaboration with IT, can define such business logic, which can contribute massively to the overall stability of the application ecosystem.
Scenario 4: Cyber-Attack Resilience
Withstanding cyber-attacks is the fourth and last scenario here. Cybersecurity specialists and CISOs have worked on this problem for decades. Thus, many organizations already have mature tools and processes in place.
Preventing and detecting cyber-attacks involves system hardening, penetration testing, access control, malware protection, and intrusion detection systems. The clouds have various features customers can activate quickly, speeding up the implementation of security controls compared to the old-fashioned on-prem world.
For containment, two complementary approaches exist: zone separation and E endpoint detection and response (EDR). EDR tools isolate and quarantine single, infected laptops, servers, and VMs. In contrast, separating network zones is a fire-division wall approach aiming to prevent lateral movement by shutting down connectivity. So, if a company’s network in Australia is compromised, they cut the connectivity to the Singaporean and Swiss network zones. Then, engineers clean up the servers in Australia before reestablishing connectivity with Singapore and Switzerland. This is a solid approach, but only if applications and business are not too interweaved.
After the containment comes recovery, i.e., restoring a pre-attack state from backups or redeploying applications with CI/CD pipelines. However, companies must be aware that attackers know about backups and try to delete them. So, immutable backups are a necessity, i.e., backups nobody can delete – not even admins. To further complicate matters, while containment and recovery tools are ‘mature’, coverage for non-VM workloads – containers and cloud-native services such as AWS Lambda or AWS S3 Buckets – can be limited.
Our exploration of the four critical scenarios – solution-internal resilience, infrastructure resilience, crash cascades resilience, and cyberattack resilience – reveals the multifaceted way to implement truly resilient IT and application landscapes.
While the public cloud brings relief when aiming for redundancy and quick-to-activate security tools, preventing domino-style cascading application crashes remains with individual application architectures. Their application design and business processes decide whether temporary decoupling from other applications and shielding them from external crashes is possible – a nightmare for managers hoping for quick solutions, a dream for ambitious architects loving to work on real challenges.