Site Reliability Engineering (SRE) Tips #6

2 min readJul 26, 2022

I attended a lecture lead by the people who wrote the following golden book, if you are an SRE and would like some thought provoking knowledge, this book is one I’d strongly suggest to pickup.

Observability Engineering book by O’Reilly

Some interesting nuggets that I got from them were:

If you implement observability from the point of development on a local branch, then catching bugs and issues at this point will save you the peak amount of cost than waiting to catch it in staging.
Tracing shouldn’t start in Staging or Production, it should start when your typing code or even before if possible. The earlier you catch the issue, the more cost savings you get in return.
SRE get alot of power to decide the direction of the observability implementation, but this power should be shared with the developers as at the end of the day, most of the implementation needs to be factored into the cost of development in which they carry out.
Having a testing environment that is not highly consistent with a production setting wont catch any of the errors which come out of the permutation of the production environment. (The stars align to cause an issue, the stars could not align in that way if the environments are not consistent, therefore you can’t catch it)
Balancing the idea of using a consistent tool across the board and using the right tool for the job should be looked at in a way where context switching is taken into consideration vs. value added of the tool being used.
State of flow is important is a something that should be optimized. If you need to wait 10minutes for your metrics to show or for something to process, this process has already broken down. The faster you can get from typing code to it showing in action, the better. Of course, including all the intermediate steps.
Observability of Pipelines is under-looked and under developed in many places.

Site Reliability Engineering (SRE) Tips #6

Written by Robert Wijntjes