Observability
Through my career as a software engineer, manager and executive common theme of observability.
Observability is more than monitoring and charts. Observability is a collection of techniques and tools that aim to improve understanding of complex systems. Highly observable systems should have improved ergonomics for operators, allowing them to more quickly grasp the impact of changes and the contributors to incidents. In other words, if you don’t have sensors — metrics, logs, etc — that help you understand how your service is working then you can’t tell if things are going wrong!
Open Source Work
- Veneur which Stripe uses to power metrics and traces. Veneur brings efficient performance and the capability to approximate “global” histogram and timer percentiles using Ted Dunning’s t-digest approximate histograms and sets using HyperLogLogs.
- Censorinus is a JVM — by way of Scala, but with no other dependencies — *StatsD client with support for both StatsD and DogStatsD.
- Dozens of contributions to Datadog’s monitoring agent and Integrations SDK.
- Perl charting library Chart::Clicker, with love to Infinity Interactive for being so supportive and Stevan Little for being so inspirational.
- SignalFx Terraform Provider which I created and maintained.
Professional Work
After joining Twitter in 2012 I quickly found my calling in the Observability team. My Observability at Twitter post was the first mention of “observability” in this context. (The team existed before me, I was just the one to share it outside of Twitter!)
Upon joining Stripe in 2015 I created and led an observability team and worked to change Stripe’s culture such that observing our systems was a core concern. I led the creation of an entirely new observability stack with minimal interruption, managed and changed vendors a few times, and contributed to large improvements in reliability and confidence at Stripe through both observability tooling and incident process.
In 2019 I joined SignalFx as a Technical Director. My role is a mix of advocacy, customer engagement, and product improvement. Late in 2019 SignalFx was acquired by Splunk.
I’m often asked by investors to discuss my thoughts of new or existing monitoring products, and I enjoy speaking about these tools with others both to learn and provide my thoughts. I’ve also participated on customer advisory boards, representing my engineering teammates and learning challenges from vendors.
Writing
- Observability Crash Course
- The CASE Method: Better Monitoring for Humans
- A 5 part series on dashboard design:
Speaking
I speak regularly and conferences across the country promoting observability and thoughtful, empathetic operations.
Monitorama 2016:
Monitorama PDX 2016 - Cory Watson - Creating A Culture of Observability at Stripe from Monitorama on Vimeo.
Here are the slides if you prefer to flip through them rather than listen to me talk.
There are also versions of this talk from:
AWS Loft 2019:
I gave at talk at the New York AWS Loft office called “Demystifying Observability” for startups. It’s a combination of beginner info and practical advice for how observability can help you even when you’re just getting started.
Monitorama PDX 2019:
I had the pleasure of giving a 5 minute “vendor talk” at Monitorama PDX 2019. These talks are sometimes product pitches, but more often they are just a chance to speak about something important/interesting for the attendees and maybe mention your product. I decided to talk about how to think about observability tooling inspired by John Allspaw’s “An Open Letter to Monitoring/Metrics/Alerting Companies”.
Monitorama Baltimore 2019:
I spoke about a Dashboard Renaissance, or techniques and processes for making dashboards a more helpful part of your observability and monitoring work. Slides are here.
KubeCon US 2019:
Originally conceived as a set of lessons from my personal role purchasing tools to my job at a vendor where I work with dozens of customers making the same decisions. This talk covers 6 different ways to improve your stance on observability from a social perspective.
SREcon20 Americas
Incidents are an amazing source of education, but we often fail to incorporate the findings into our observability tooling. This talk provides methods for doing just that, with a bit of help from my friends at Jeli.
Podcasts
- In March of 2017 I spoke with Software Engineering Daily about my observability work at Stripe.
- In April of 2018 I spoke with Software Engineering Daily again about observability pipelines.
- In April of 2019 I spoke with Real World DevOps about going from a customer to a vendor and some observability stuff.
Other
- Wrote a guest post for Honeycomb talking about making people awesome with instrumentation.
- Reviewing posts and early books in the observability space, such as a technical review of Cindy Sridharan’s Distributed Systems Observability.