Share

Partial failures, service degradations and local issues are common

Gartner, a world leader in research and advice for companies, points out 9 principles to maximize the resilience of environments in A cloud.  

Chris Saunderson, Senior Director of Analysis at Gartner.

“Cloud is not magically resilient and software failures, not physical failures, cause almost all of its outages,” says Chris Saunderson, Senior Director of Analysis at Gartner. “In the Cloud, outages almost never involve your entire provider, nor are service outages likely to be total. Instead, partial failures, service degradations, individual service issues, or local issues are more common.” 

The team of Input and Output (I&O) needs to understand the characteristics and common causes of Cloud outages. They point out that most failures are partial, tend to be intermittent or involve performance degradation, where they are less noticeable. There are differences in resilience between the services offered by Cloud providers. 

“Resilience is not a binary state,” explains Saunderson. “No one can claim absolute resilience — not you, not any cloud vendor. Clouds should be as resilient or even more resilient than on-premises infrastructure, but only if the Input and Output use them in a resilient way.” 

Gartner analysts recommend that leaders of Input & Output focus on 9 key principles to improve the resilience of your cloud-based environments:

 

  1. Alignment with Business: Align resilience requirements with business needs. Without this alignment, teams will not meet resilience expectations or will overspend. 
  2. Risk-Based Approach: Adopt a risk-based approach to resilience planning that extends beyond catastrophic events. Place more emphasis on the most common failures, where companies have the most control to mitigate. 
  3. Dependency Mapping: Build dependency graphs that map all middleware components, databases, cloud services, and integration points so they can be architected and configured for resilience and included in both reliability and disaster recovery (DR) planning. 
  4. Continuous Availability: The continuous availability approach focuses on keeping applications, services, and data available at all times and at service levels with no downtime and limited impact during a failure event. 
  5. Resilience by Design: The application itself must be resilient by design. Infrastructure resiliency alone is insufficient to provide the downtime-free services that end users expect. 
  1. DR Automation: Implementing fully (or nearly fully) automated disaster recovery — whether through the company's own tools or through third-party cloud-native DR tools — provides the foundation needed to meet aggressive recovery time objectives (RTOs) and allows you to routinely test disaster recovery. 
  2. Resilience Standards: Adopt resilience standards beyond architecture and DR. Resilient systems require teams to focus on quality, automation and continuous improvement, and embed quality throughout an application's lifecycle. 
  3. Prioritize Cloud Native Solutions: Cloud providers offer a wide range of solutions that can be used to improve resilience. When feasible, business leaders Input and Output they should take advantage of these solutions instead of trying to invent their own alternatives and add even more complexity. 
  4. Focus on Business Functions: Rather than restricting thinking to just replacement-like “recovery,” explore options such as lightweight IT alternatives or lightweight application replacements that provide minimal business-essential functionality. 

Gartner customers can learn more in the survey “9 Principles for Improving Cloud Resilience" and "Quick Answer: How Should Executive Leaders Plan for Cloud Outages?”.

quick access