Being on-call

4 min readAug 25, 2022

Like any other software engineer, I’ve done my fair share of troubleshooting. However, I have handled enough on-calls and will now share my experience and the sequence of events when a production incident is reported, impacting hundreds of Engineers/Customers.

I’d want to offer some incident management techniques in this post that have helped me stay composed under pressure and efficiently handle events while working or managing as an SRE/Team in my previous role.

Every Org has effective mechanisms in place for monitoring and alerting, triaging problems, and having an obvious incident response procedure. These are of the utmost importance. And to get ready, you had to observe on call the previous week to obtain a good overview of everything.

A guideline for SRE must follow:

Handover:

Before an on-call week begins, I talk to the person who is currently on call to find out about any problems or strange bugs that were found in our stack during the previous week.
Any outages happened during his/her oncall- tenure.
Any issues/workaround to be told.

Organizing Slack:

Now is the moment to structure internal communication if you use Slack.
It must have dedicated user and system channels and this will be easy for the on-call person to categorize and act based on the severity and priority.
Mute/leave channels that you are not actively participating in to reduce clutter and distraction (Must for if you are participating on-call)
A dedicated channel for incidents.

Alerts:

Trust the system, but verify; not all alerts are Incidents.

An alert could be non-actionable as it was recently configured with a threshold that’s making it too noisy. In other words it’s false positive.
Your team member doing deployment and forgot to silence the alerts.
Monitoring stack is down and alerts being triggered.

All of the aforementioned has happened to me at some point. Now, you must review your services (metrics, logs, reproduce the reported problem, etc.) to confirm that the issue is real or not whenever you receive either user/ system’s notification.

Declaring an incident :

Not all the actionable alerts are incidents.
It’s based on the severity and priority based on the org product charter. Every org has defined guidelines for SRE teams about what warrants an incident along with different levels of severity. This eliminates any room for speculation and gives the entire team a common understanding.
Consider whether an issue requires coordination with other teams, impacts customers, or violates an SLO. Declare an incident if any criteria are met. Better to declare an occurrence early than late.

Communication during an Incident

Every incident is titled to inform stakeholders.
Incident scope (Impact, Region, Environment, Feature, customers).

The site’s login isn’t working” vs. “The site’s login isn’t working for US traffic” are significantly different.

Communicate and assign an incident lead

Slack becomes your Buddy, everyone should have joined the incident channel from the incident notification and it would have given an initial impression of what’s going on with the system.

Communicate changes to the system

Always make modifications to the systems after getting quorum consensus so after-effects can be easily tracked.

Provide regular updates

It’s easy to get overwhelmed by the incident’s technical specifics and miss an update. This causes furious leaders and confused customers.
Either send a 30mins update or update your WebUI dashboard, if you are maintaining any.
If things aren’t moving or going correctly, call for help or follow the escalation matrix. It’s not your job to fix the problem, uncover the root cause, and mitigate the issue. It’s a collaborative effort
Your focus must be on actively pulling in other engineers who could help with debugging or resolving the issue early in the process.

Collaboration

Collaboration is crucial for resolving incidents. Given the multifaceted SRE function, it’s a crucial yet undervalued ability.
Ask questions but detailed one and pointed one: Remember you on an incident
With several persons participating, various alternatives are offered, and the incident lead must steer these discussions while filtering out the noise, so share a probable cause early to eliminate options.

Resolution

Your change schedule: things break if something new is introduced.
Code or config deployments are two of the most common actions that result in change. So, rule those out first. A few things to check for change are:

Application-specific code or configuration deployment
Infrastructure-wide configuration changes (network, OS, etc.)
Traffic pattern

Identify what’s unusual about the system that’s causing a problem. Most of the time it’s a combination of change that brings you to this Incident.

Learnings from the incident

Treat these incidents as an opportunity to learn something new about the system and your product.
It makes incident response entertaining and intriguing, like being a detective. When the incident is resolved, the team feels satisfied, accomplished, and proud and it’s priceless :)
Keep calm and carry on- one step at a time.

Last but not least:- Be available to help teammates when not on call. You’d help them and learn something. Ensure the postmortem and incident resolution are blameless. It promotes ownership, speeds up issue resolution, and improves team performance over time.