Scott Alcott, CIO, Comcast Corporation
Resiliency might not be the first thing that leaps to mind when you think about inventing the next great app, or modernizing your back office. But effective resiliency planning represents an essential cornerstone when you’re working to build a culture of innovation.
One thing we’ve heard loud and clear is that our customers want their services to work. While innovative new tools and services are welcome to consumers, and critical to businesses seeking to maintain a competitive edge, those innovations cannot come at the expense of stability.
When it comes to innovation, this dynamic creates an obvious challenge. Innovation is disruption, and disruption is – by its very nature – destabilizing. Companies can work around the disruptive aspects of new products by involving customers in the innovation process with public beta tests and trials, but nobody gets a pass on maintaining the stability of their existing products and services.
IT departments are increasingly in the public eye. We’ve all read the high-profile headlines about glitches and customer impacting incidents at large and small enterprises worldwide. Beyond the disruption we invite on ourselves by innovating, we know that we all face mounting challenges in the form of incidents that – due to the networked nature of our systems – have the potential to impact millions of people.
Tackling resiliency offers an opportunity to address two critically important challenges: clearing the way for positive disruption, and ensuring that existing services are well protected against unforeseen events.
"Innovation Is Disruption, And Disruption Is – By Its Very Nature – Destabilizing"
We started down the path to develop an enterprise-wide resiliency program in 2013, with a sweeping internal program that we called the Performance Management Reboot. Our initial approach, which was short on capital investment and long on sweat equity, was to take a much closer look at our top customer-impacting applications, identify challenges, and start implementing solutions.
Our motivations for launching Reboot were many, but one of the biggest driving factors was the breakneck pace of innovation and development happening within our IT department. Our enterprise-wide move into agile development, as well as major developments in the areas of customer self service, support tools and data analytics were changing how we do things for the better, but also forcing everyone to adapt to new technologies and protocols.
We knew going into that process that there were five major causes of IT incidents:
• Systems and applications issues
• Third party application issues
• Network issues
• Local telecom issues
• Care team/facility issues
We set out to examine all five of those areas to understand our key pain points with a focus on how IT events impacted the customer. Through the Performance Management Reboot process we were able to improve the functionality of 10 critical applications with a focus on active-active load balancing, smaller failure groups and upgrades to end-of-life platforms.
We’ve also leveraged the work of our colleagues on the Comcast Elastic Cloud team, who have built our OpenStack-powered private cloud into a vital and thriving technological engine for groups throughout the company. By bringing our cloud services to bear on the Reboot process we were able to leverage infrastructure-as-a-service (IAAS) and platform-as-a-service (PAAS) efficiencies to improve those core applications.
While narrow in scope, the results of the reboot process were both immediate and significant. Without major investment, we were able to reduce IT incidents by 25 percent, including a 56 percent reduction in the mean time to repair.
For us, as I expect is true for many companies, a reduction in incidents, or the time to fix them, aren’t just black numbers on a spreadsheet. Every incident we can prevent represents a customer who watches their favorite show on demand and without incident, or settles a billing issue online without hassle, or has a minor issue resolved quickly and courteously. Improving resiliency is critical to our larger mission of transforming customer experience into our best product.
With those results we demonstrated to ourselves, and to senior leaders throughout the company, that focusing on resiliency could deliver measurable results. Earlier this year, we took the step of formalizing the work we began with the Performance Management Reboot by establishing a permanent Resiliency Program.
The Resiliency Program expanded on the work begun by the Reboot, increasing the number of customer-critical applications that we’re focusing on from 10 to more than 30, and adding resources to accelerate improvements.
We’ve set an aggressive goal with the program of reducing the business impact of incidents on key applications by 40 percent (through reduction in incidents, duration, severity and functional/ system failure group). And some of the early results we’ve seen this year suggest that we’re on track to exceed that number.
We have a lot of work ahead. With more than 600 applications, multiple data centers, thousands of miles of cable, connectors, firewalls, load balancers, databases, and systems all working interdependently to provide our customers world-class video, voice and Internet service, we know that we will occasionally experience customer-impacting issues. The Resiliency Program is not a silver bullet, but it is an effective way to organize stability efforts into a holistic regime, and create best practices to ensure that issues are minimized, and those that arise are addressed quickly and with a minimum of friction.
Foundation for Innovation
The ultimate value of institutionalizing and formalizing a Resiliency Program may lie in such a program’s ability to unlock innovative growth. Once we (and our customers) are confident that the trains are running on time, we can focus on building newer, better trains.
It’s an exciting time to be a CIO. Cloud and virtualization tools are enabling dynamic, dramatic improvements to how we deliver IT services and maximize the value and efficacy of those services throughout our organizations. Our IT operations must evolve in order for our organizations to remain competitive, and in order to evolve we must undertake major disruptive changes to how we do business.
And while tools like Software Defined Networking and Network Functions Virtualization may hold the keys to streamlining and supercharging back offices around the globe, they also represent major disruptive shifts in how organizations manage IT resources.
The strength of establishing a formalized Resiliency Program is that resiliency is agnostic to technological innovation. When you equip your organization to better manage through disruption, the type and cause of that disruption – whether self-imposed or externally driven – becomes less and less important. Your organization still reaps the reward in the form of reduced incidences, and shorter windows for remediation.
Your Resiliency Program probably won’t win the Nobel Prize or garner splashy headlines, but it might just be the tool that insures that your organizations innovative investment yields real value to the people who matter most.