I’ve worked in the financial industry for almost a decade now, and in that time I’ve seen my fair share of network/system outages. When I started, I heard plenty of stories about “RGE’s” or “Resume Generating Events” as they were called around the office, and was jokingly advised to always keep my resume up to date. That’s not bad advice by itself, but I wanted to share my opinion on dealing with outages and the aftermath.
When I started, I lived in fear of what would happen if something went down during my watch. My heart skipped a beat each time an alert popped up on our monitoring system. And the first time an outage occurred, I was sure it was going to cost me my job.
Thankfully, that didn’t happen.
Over the years, not only have I grown less fearful of outages, I’ve learned that there are two phases to an outage: The outage itself (including the resolution), and the post-mortem.
The Outage
Obviously, if something is going wrong, the primary objective should be restoring service as soon as possible. Key to a rapid resolution is having clear channels of communication with other teams and knowing the most important people to bring into the loop during an issue. This is called being transparent about issues. Don’t wait to open channels of communication to partner teams or managers. In the financial industry information is key. Clients might be unhappy that you’re having problems, but they will be downright livid if you had an issue and you didn’t warn them. This can also have liability issues attached, so be sure to think out these things with your management.
Also vitally important is a strong understanding of the network, how things connect to each other, and having a good monitoring system in place. It almost goes without saying, but if you don’t know how anything relates to each other, and you have no way to monitor your systems, you’ve got some other issues to deal with.
The Post Mortem
You’ve solved the issue and things are working smoothly again — disaster mitigated! What happens after the outage, is in my experience, equally as important as resolving the issue itself. I believe this is where you can really shine and set yourself apart.
A mentor of mine once told me that what people really want after an outage is the answers to three questions:
- What happened?
- What did you do to fix it?
- When did you know about it?
And he followed up by saying that out of the three questions, the last was really the most important. His opinion, and one I share now, is that accidents happen, equipment breaks, circuits get the backhoe treatment, etc. and you generally can’t totally avoid outages. What you can do (besides practicing good operational principles) is respond promptly to your alerting system or reports from users, and show initiative by investigating on the first call and not on the 10th. Act quickly and openly.
Depending on the nature of your team, it’s likely that only a few people will understand the true technical cause of an issue. As you communicate up the chain, the gruesome details will be lost. Its like the telephone game:
Engineer to manager: "We started taking excessive CRC errors on an interface without the line actually going down."
Manager to CTO: "We had a line problem."
CTO to CEO: "It broke."
Ok, so maybe that’s a little too simplified, but you get the idea. You don’t speak the same language as the upper echelon of management, and conversely, they don’t speak your low-level technical language. All they care about is when did we know about it, and how soon did we fix it.
Everything Else
This is not meant to be an exhaustive description of everything that should happen before/during/after an outage. I also don’t want to get into the details of what constitutes good operational principles, but I’m thinking of things like exercising caution when implementing changes, not being cavalier about production environments, and only making changes after following proper change management procedures , and then only during approved change windows.
I’m also aware that there are different cultures around the human side of outages. Some companies or managers will roast you and leave you for dead no matter how small the misstep, while others will have a more forgiving approach. That’s another topic as well.
Just remember that no matter how deep your bureaucracy, or how carefully you plan, there will always be unexpected issues. Just be sure that you do everything you can to be responsive and transparent about the issues you face.
Have thoughts on this subject? I’d love to hear your comments!