The title is succinct, but in practice an organization’s “observability” efforts range a number of disciplines. This document aims to compress the breadth of topics into a succinct (fitting on 1 printed page) set of best of breed write-ups that avoid any ties to vendors or implementation.
- Cindy Sridharan’s Distributed Systems Observability: This is a free, 25-page eBook that does a great job summarizing the various tools and concepts available. You could also read Monitoring and Observability from Cindy’s blog for a shorter and less formal version of the same.
- Google’s Site Reliability Engineering: This book has provided a lingua franca for discussions around the practices of reliability. It includes lots of great material like “Golden Signals”, error budgets, and more. You can read it online for free.
- Fred Hebert’s Operable Software careens from observability to user experience and reminds us why this all so important.
- Specific guidance for some of the “pillars” of Observability:
- Prometheus’ Guide To Metric Types covers the basics of metrics and their Metric and Label Naming reminds us that our metrics are also an interface for our engineers, and how we can standardize.
- Charity Majors’ Logs Vs Structured Events describes how we can turn logging from a burden into a blessing.
- On the practice of measuring and using this tooling for the day-to-day:
- Coda Hale’s Metrics, Metrics, Everywhere touches on everything from mental models to OODA loops and generally explains how to measure and why you should. It’s 7 years old and low quality, but is the best summary I’ve ever heard.
- Baron Schwartz’ What Metrics Should I Monitor helps to frame what to pay attention to in systems. It’s aimed at MySQL but can be applied to other systems.
- Kavya Joshi’s Applied Performance Theory is an excellent talk who’s title couldn’t be more apt. It gives quick and practical advice on using many of the formal topics from performance engineering.
- John Allpaw’s Owning Attention (Considerations for Alert Design) is a class in alerting for humans.
- On incidents:
- Gremlin’s How to Establish a High Severity Incident Management Program provides a good example of how to think about and handle incidents.
- The STELLA Report is the findings from the review of a few incidents and how engineers cope with them. It provides some strong food for thought for organizations and is a good gateway drug into the work of
Enjoy mailing lists and such? Here are some good ones:
- Monitoring Weekly is exactly what it sounds like.
- Thai Wood’s Resilience Roundup summarizes papers in the resilience space and adds special insight from his combined tech and EMT background.
- Lex Neva’s SRE Weekly frequently hits topics in or adjacent to observability.
As a long time advocate of observability I hope it’s ok to add a few bits of my own. First, my definition:
Observability is a quality of software, services, platforms, or products that allows operators to understand how systems are working. Observability makes investigating and diagnosing problems easier; the more observable a system, the more tools we’ve made available to diagnose problems or understand behavior.
And some of my works:
- My talks on observability across the years.
- Structure and Layout in System Dashboard Design aims to condense the work of many other sources into practical advice for how to make great dashboards.
- The CASE Method: Better Monitoring for Humans aims to give a vendor agnostic, manual-if-needed process for controlling alert fatigue and measuring value.
I’ve not read all of these yet, but seen them referenced enough to think they are worth a mention.
Seeking SRE is a supplement to Google’s SRE book, aimed at how the SRE role can be applied to organizations that aren’t Google.