How to Design a Good On-call Process 🚨
Everything you need to know about rotations, with lessons from Netflix, Dropbox, Intercom, and Google.
Hey 👋 this is Luca! Welcome to a new 🔒 weekly edition 🔒 of Refactoring.
Every week I write advice on how to become a better engineering leader, backed by my own experience, research and case studies.
Here are the latest articles you may have missed:
To receive all the full articles and support Refactoring, consider subscribing 👇
You can also learn more about the benefits of a paid plan.
On-call is a divisive topic in engineering, and for good reason. People hate being on call because it's stressful and disruptive to their personal lives — even when they don’t get actually paged.
I know it from up close.
As a founder & CTO, I feel I spent enough time on-call for this life and the next three or four. In the worst cases, it was disruptive to my sleep, my morale, and left me not wanting to be anywhere close to a computer again.
But it doesn't have to be this way.
If people hate being on call, chances are you are doing it wrong. In the best teams, being on call actually improves the team’s morale. In fact, it can bring several benefits, like:
Strengthening the relationship between engineers and customers
Developing better ownership by engineers
Maintaining better docs
Enforcing good instrumenting / observability
In this article, we will explore the key elements that make an on-call process successful and we’ll cover how to design a great one. This will be drawn from my own experience and the one of successful companies like Netflix, Dropbox, Honeycomb, Intercom, and Google.
We will cover:
🏅 Ownership — the (non) difference between engineers and ops people.
📏 Scope — what goes into an on-call shift.
✏️ Designing rotations — everything you should take care of.
📉 Reducing effort — best practices to make things sustainable.
📊 Metrics — how to measure your on-call process.