Outages in the Cloud: A Learning Experience
July 28th, 2011 By: Industry Perspectives
Lucas Roh founded Hostway in 1998 and since then has charted the company’s growth to achieve an international presence, Hostway is ranked as one of the top-five Web hosting companies globally.LUCAS ROH
As the newer kid on the block for data storage, cloud services receive significant attention after outages. In spring 2011, Amazon Web Services experienced a widespread outage that disrupted sites such as Reddit, Hoot Suite, Quora and Foursquare. Word of the issue quickly spread, re-opening the discussion about cloud providers’ ability to maintain uptime and protect sensitive data.
Outages can directly impact the finances of cloud providers who are consistently looking for new ways to limit the reach and duration of outage events.
Best Practices to Limit Outages
Companies that are already in the cloud and those who are considering cloud solutions should test their own critical systems to be sure their internal architecture can handle failure. Applications should also be tested for their ability to be restored quickly, which helps ensure the companies’ end users experience zero or minimal interruption in service.
Companies should consider randomly introducing failures, so the internal IT team can test their responsiveness under real-world conditions and find the best solutions to mitigate outages. Sharing the results of these stress tests and failures with the cloud provider will help build a more integrated system for managing outages.
Top solution providers are consistently implementing new controls and safeguards that can limit the frequency and severity of outages. For example, they might upgrade their cooling or electrical systems, or introduce new security devices to control breaches. The provider should also have redundant and backup capability to handle any amount of capacity needed after the outage. Their flexibility in scaling up or down to meet demand should also allow them to move data to redundant systems in case of an outage or disaster.
Mistakes Provide Key to Better Service
After an outage, quality service providers will analyze the data to identify any weak processes. They need to find out if hardware, human error, or perhaps internal documentation is to blame for the break in service. Once the root cause is identified, the provider needs to learn and adapt by implementing new procedures and safety checks.
While outages are significant events and should be avoided, customers should not react by moving back to on-premise solutions. Remember that outages occur every day at internal server rooms; they are simply not reported by the blogosphere or media outlets. Neither solution offers true 100% uptime, but the cloud does offer unmatched flexibility and efficiency.
Picking a Partner
In a crowded marketplace, where every provider makes claims about uptime, security and reliability, finding the right partner can be a challenge. Transparency of information is vital. Ask the provider to give detailed information about past outages, including what steps were taken during the outage and what new procedures were subsequently put in place. You want to stay informed, so be sure the provider has steps such as Twitter feeds or auto-emails to let clients know status updates.
Without transparency, then conjecture takes over, and clients can quickly lose confidence in the provider. Top providers will be proactive, both in their dissemination of outage information and their willingness to introduce redundancies to prevent outages.
Moving Past Outages
Despite the risks of outages, for most companies, the reward of lowered costs and greater efficiencies with the cloud is worth it. Going back to an in-house data center requires higher capital costs, maintenance, and in many cases would require hiring back more IT staff. Cloud computing is still in a growing stage, and while failures can be damaging in the short term, they serve a greater purpose by allowing cloud providers to evolve and become more proactive.
Outages will continue to occur. The best approach is accepting the risk and using them as an internal learning tool to be sure your company’s data is backed up and secure. Cloud computing remains the future for business, as the combination of flexibility and lowered costs can simply not be beat.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.
FrankPosted July 28th, 2011
Hostway is experiencing a major outage right now – 7/28/2011 9:20am. I’m unable to reach any of my servers on their system. I can’t get a hold of anyone who has any answers and I’m losing hundreds of dollars an hour while my employees sit around doing nothing. And this guy is supposed to be an expert? Hostway has been reliable for us in the past, but I’m going to have to move to something else. Hours of downtime in the middle of the U.S.work day are simply unacceptable. Hostway – get your damn act together.
These are certainly the best practices that should be taken by businesses depending on cloud computing for their files and services. I always make sure–and I always recommend–to my clients to pick those that are really reliable. Most of all, they should always ask for the steps they usually take during downtime. This is a crucial step often forgotten even by huge companies.
Vijayanathan NaganathanPosted August 18th, 2011
Outages have a direct impact on the business of clients as well as on cloud service providers. Testing can play a big role in preventing outages in the cloud. It is essential to have an architecture assessment and an application validation done earlier in the life cycle. This ensures the robustness of the application which causes minimal service disruption experiences for the end users. IT organizations need to have test strategies in place that would simulate outages internally in order to test the responsiveness of the application to handle such events.
The test strategy of cloud service providers should account for simulating real time scenarios that involve infrastructure upgrades, auto scaling of resource pools and control break-ins. A joint collaboration between the cloud service provider and the IT owner of the application would can help devise robust back up /redundant data storage approaches to handle outages more effectively.
To know more about Infosys’ Validation and testing Services go to http://www.infosysblogs.com/testing-services/