What is Toil in SRE and How Can You Reduce It?

Toil is the manual, repetitive operational work that keeps systems running but leaves nothing better behind once you finish it. The system runs for a few more hours, then needs the same intervention the following week.

The word comes from Google's Site Reliability Engineering book. The team there needed a name for the work that consumed engineering time without producing anything they could point to later, so they could measure it and reduce it.

The test is simple. If a script could do the job, if you have done it before, and if nothing about the system is better once you finish, you were doing toil.

Restarting a hung service at 3 a.m. is toil. Renewing the same SSL certificate by hand every six months is toil.

Closing the same low-priority ticket repeatedly is toil. Planning meetings, design documents, and code reviews are not toil. That work is overhead, and it produces something durable. Toil produces nothing.

What are the Seven Characteristics of Toil?

Google's SRE book lists seven traits that mark a task as toil. A task does not need all seven to qualify, and the more traits it carries, the more clearly it counts.

Manual. A human is the one running the commands, clicking the buttons, or moving values between systems. Without that person sitting at the keyboard, nothing moves.

Repetitive. You did the same task last month, and you will do it again next month, on a rhythm you could put on a calendar.

Automatable. A machine could do the job exactly as well, or better. The only missing piece is someone to write the code.

Reactive. The work starts because something broke, alerted, or demanded attention. You did not choose to begin it; the system pulled you in.

No enduring value. Once you finish, the service is in the same state it was before. Nothing has been built, fixed, or improved for next time.

Scales with the service. As the service grows, the toil grows alongside it. Twice the customers means twice the tickets, twice the restarts, and twice the manual checks.

Interrupt-driven. The task arrives in the middle of work that mattered, breaks your focus, and forces a context switch you did not plan for.

A senior engineer rebuilding a deployment pipeline scores zero. A junior engineer clearing the same stuck queue every morning scores six or seven. That gap is what most teams need to close.

What is the 50 Percent Rule in Toil in SRE?

Google set a hard limit in its SRE doctrine: no SRE should spend more than half of their time on toil. The other half goes to engineering work that reduces future toil. The number is not sacred but functions as a tripwire. When a team crosses 50 percent, something has to change: hire, push back on new services, or stop accepting more work until the toil comes down. The cap forces that conversation before burnout sets in.

The cost of ignoring it is real. Engineers wear down, experienced people leave, and those who remain are stuck on repetitive work that does not help them grow, because "cleared 4,000 tickets" does not belong in a promotion case. The team has nothing left for the engineering that would have prevented the toil in the first place.

The 2024 Accelerate State of DevOps report from Google's DORA team reinforced this. It names reducing toil as one of the practical strategies for improving developer experience, and links burnout directly to the conditions that constant operational work creates.

Expect the first measurement to be uncomfortable. Most teams find they are well past 50 percent on their first honest count, which is normal and is the reason for counting.

How to Reduce Toil

The first step in reducing toil is visibility. Track how engineering time is spent over a fixed period, then classify it into toil, engineering work, and overhead activities.

Toil often hides within on-call rotations, incident response, and legacy operational routines that persist without review.

Once identified, there are four effective approaches to reducing it:

1. Automate repetitive response paths

If an alert always triggers the same response, automate the entire flow so the system handles it directly.

2. Enable self-service workflows

Provide internal tools or portals so engineers or users can resolve common requests without manual intervention.

3. Fix root causes instead of symptoms

Recurring incidents usually indicate underlying design issues that should be corrected rather than repeatedly handled.

4. Make toil visible over time

Tracking and reviewing toil metrics regularly ensures it remains a management priority rather than an invisible burden.

Not all toil should be eliminated immediately. Rare or low-impact tasks may not justify automation costs. However, tasks that occur frequently, carry high risk, or scale with system growth should be prioritized.

The objective is not to eliminate toil entirely, but to keep it below a sustainable threshold so engineering effort can focus on long-term reliability and improvement.

Explore More IT Terms

Browse our comprehensive IT glossary to learn more about technology terminology.

Back to IT Glossary Contact Us