How to Design a Good On-call Process ๐จ
Everything you need to know about rotations, with lessons from Netflix, Dropbox, Intercom, and Google.
On-call is a divisive topic in engineering, and for good reason. People hate being on call because it's stressful and disruptive to their personal lives โ even when they donโt get actually paged.
I know it from up close.
As a founder & CTO, I feel I spent enough time on-call for this life and the next three or four. In the worst cases, it was disruptive to my sleep, my morale, and left me not wanting to be anywhere close to a computer again.
But it doesn't have to be this way.
If people hate being on call, chances are you are doing it wrong. In the best teams, being on call actually improves the teamโs morale. In fact, it can bring several benefits, like:
Strengthening the relationship between engineers and customers
Developing better ownership by engineers
Maintaining better docs
Enforcing good instrumenting / observability
In this article, we will explore the key elements that make an on-call process successful and weโll cover how to design a great one. This will be drawn from my own experience and the one of successful companies like Netflix, Dropbox, Honeycomb, Intercom, and Google.
We will cover:
๐ Ownership โ the (non) difference between engineers and ops people.
๐ Scope โ what goes into an on-call shift.
โ๏ธ Designing rotations โ everything you should take care of.
๐ Reducing effort โ best practices to make things sustainable.
๐ Metrics โ how to measure your on-call process.
Letโs dive in!
Hey ๐ this isย Luca! Welcome to a ๐ย weekly editionย ๐ of Refactoring.
Every week I write advice on how to become a better engineering leader, backed by my own experience, research and case studies.
To receive all the full articles and support Refactoring, subscribe now and join other 30,000+ tech leaders.
Or learn more about theย benefits of a paid plan.