Maintaining Our Infrastructure as Incident Commander

Maciej Mendrela

Publication: 2021-08-06 09:14Updated: 2022-10-06 09:45

Maintaining the modern SaaS system based on cloud operational 24/7 is definitely not a trivial thing. Especially if your product grows 100% year to year. Through the last 6 years we were figuring out the process which allows us to keep an eye on our system health, without building a dedicated team.

There is definitely a lot of space to improve, but today I would like to share how we maintain our application and present the approach. Nevertheless before we move on, let’s get acquainted with some definitions.

On-call – to ensure continuous operation of the system. Being on-call means being ready to take care of infrastructure issues (e.g. there is a risk that we can start losing data) and return all services to a normal state. There are plenty of companies out there where an on-caller is needed 24 hours a day. As a consequence, it is a norm to answer the phone at 3 am. In edrone from the very beginning we try to set some boundaries to not make on-call a burden for engineers – so on-call lasts between 9am to 10 pm, 7 days a week. In our Operational Level Agreement we want to start investigation after a maximum of 60 minutes from problem notification.

Incident – there are probably many “incident” definitions out there. Ours covers things like users are not able to log in or register to the application or when we start losing data etc. Nevertheless, I believe people around the organization have a natural gut feeling as to what should be treated as an incident and what can wait until the next working day.

Alarm – thanks to AWS services we can be informed in real time on how our system is operating via Slack channels. Alarms are mostly triggered by metrics stored on CloudWatch. We keep an eye on almost every aspect of our system. We want to know when some part of our system fails or when a message is stuck in a queue etc.

Incident Commander – IC – we took this name from PagerDuty since their definition of “IC” appeals to us the most. IC is a person who “Keeps the incident moving towards resolution”. We are not expecting that every engineer will know everything – especially new members, but we require people to move things forward by contacting more senior colleagues and by keeping stakeholders in the loop. This allows them to know what is going on without constantly asking. It’s a role that takes care of alarms during the working hours. The role rotates between team members who are not participating in the “pitch cycle”. We want to keep an eye on our system daily and react whenever it is needed. From what we see – such a preventing attitude works quite well in our case.

Incident Commander in practice.

Slack is our main notification system. On the “engineering” channel the information about rotation is displayed daily.

There are two notification groups – On-Call and IC. Both are rotating independently. On-call is always covered by the whole engineering team – as the team grows the number of duty days naturally diminish.

During working days on-call last from 3 to 10 pm and from 9 am to 10 pm on the weekends. We want to rotate people during the month as we don’t want to have assigned anyone to a specific day. A different person on the Engineering team is on duty each day. It would be unfair if someone covers all weekends all the time. Thanks to daily notifications on Slack engineers know when they expect their shift. However, Slack is not the only way to be notified about shifts. Proper configuration of xmatters mobile app can also inform people via push notifications about their shift a few days prior.

A similar rotation is happening for IC roles. The Incident Commander role only exists during working hours – notifications about shift changes come at 7am, 5 days a week.

For channels “Bugs” and “War” – topic changes daily, so people from other departments know who’s on call now.

Slack channel strategies that help us to coordinate communication.

engineering – primarily used to exchange information inside the team. As I mentioned before engineers can find information about on-call/IC shifts here.
bugs – helps us react to some global problems. Whenever the support team finds something critical, they notify us through this channel.
alerts – thanks to AWS services we are able to get the most crucial information(Alarms) on how the system behaves and reacts to it, so we have a better understanding about our system’s current condition. This is one of those channels on which the Incident Commander keeps his eye.
alerts-sql – aggregates all “slow SQL queries” – where execution time is longer than 10 sec. This channel helps us with two things – firstly correlates all timeout alarms on alerts channel. Secondly, since we make weekly reviews of slow queries, it guides us on which query we should deal with first based on frequency.
war – we always communicate through this channel whether there are some serious outages – this usually gets more people engaged.
pingdom – We use the Pingdom Tool to check whether our application is fine but from a customer’s perspective.

Since we know the responsibility for each channel. Let’s now talk about how it works on a daily basis.

After one of the engineers wears an IC hat he usually starts from overviewing the “alerts” channel. The goal is to go through each notification for the current day and check whether it is an actionable item(we can do something about it immediately) or we already know that problem exists and there is a jira waiting to be solved. Playbooks are assigned for each notification so no matter who is currently on IC, it allows us to make initial steps for further investigation. To keep our playbooks up to date – we try to add context whenever we find something unique.

To inform the rest of the team about the progress during the day we follow the “emoticons coding”. There are several emoticons informing us about the situation for each notification.

(eyes) – Tell the team you are looking at the problem

(repeat) – Duplicate of earlier notification

(heavy_check_mark) – Problem fixed, no more actions needed

(arrow_right) – Delegated – somebody else (not current IC) is responsible for investigation

(outbox-tray) – We triaged the problem and gathered details. We cannot fix it straightaway and we created Jira for later.

(no_entry_sign) – This should be ignored (testing, false positive)

We want to keep the number of alarms as low as possible to give space for the Incident Commander to investigate each of them. When the number of alarms grows it is impossible to get to the root cause for every issue. The more problems we solve during the working hours the bigger chance is that we do not miss anything that could potentially be dangerous during someone’s shift.

Training for new engineers & preparation.

New engineers spend around two months off IC role before they jump into a new position. During that time they have an opportunity to see other engineers in action. We use a shadowing method here, an individual follows around, or shadows, the colleague already in the IC role. It’s undoubtedly not enough to be fully prepared, but since we have playbooks and others to help – a new engineer is ready to start his journey as an Incident Commander.

Nevertheless, it is a great way to learn the other part of the system while still making sure our system is constantly healthy.

Preparation phase includes:

Smartphone
Add all of your team members and people you regularly work with to your contacts
Check that your contact details on internal document are up to date
Install and configure a Slack client on your phone
Install the Google Meet app on your phone
Install and configure xMatters
Check access to tools
Google Account
AWS Console
Admin
xMatters
Slack
BitBucket
JIRA
Check VPN access
Check access TEST environment

In order to be ready to adapt to an IC role, new engineers can do a few things when starting a job.

Firstly, the sooner an engineer gets started, the less time he will be playing catch up and the more he will feel comfortable in the position. Learning as much as you can right away will help you adapt quickly.

Similarly, you must be able to take initiative. If you take initiative you will most likely make mistakes, but do not let these rattle you. Mistakes can be fixed and experience is your best teacher. Especially in that role, you never know until you try something. Your first initiative may fail, however you will be rewarded because you tried something.

Finally, always ask questions.The whole team is here to help you. They will be happy to answer your questions. Remember your co-workers were once in your position.

Conclusion

The Incident Commander role should be properly described as it is an important factor of every organization. Creating IC culture can be time-consuming and requires rigorous preparation. However, understanding the purpose of IC will help the organization determine the root cause of your application issues much faster on a daily basis.

Maciej Mendrela

Senior Software Engineer at edrone. Over five years of experience. Co-creator edrone CRM being responsible for architecture design and implementation.