Observability

Through my career as a software engineer, manager, and executive, observability has been a constant theme.

Observability is more than monitoring and charts.** It’s a collection of techniques and tools that deepen our understanding of complex systems — not just when things go wrong, but all the time.** The goal is to give operators genuine insight into how a system behaves: what changed, what’s slow, what’s correlated. Good observability improves the ergonomics of that understanding, making systems less surprising to the people who run them.

Open Source Work

Veneur which Stripe uses to power metrics and traces. Veneur brings efficient performance and the capability to approximate “global” histogram and timer percentiles using Ted Dunning’s t-digest approximate histograms and sets using HyperLogLogs.
Censorinus is a JVM — by way of Scala, but with no other dependencies — *StatsD client with support for both StatsD and DogStatsD.
Dozens of contributions to Datadog’s monitoring agent and Integrations SDK.
Perl charting library Chart::Clicker, with love to Infinity Interactive for being so supportive and Stevan Little for being so inspirational.
SignalFx Terraform Provider which I created and maintained.

Professional Work

After joining Twitter in 2012 I quickly found my calling in the Observability team. My Observability at Twitter post was the first mention of “observability” in this context. (The team existed before me, I was just the one to share it outside of Twitter!)

Upon joining Stripe in 2015 I created and led an observability team and worked to change Stripe’s culture such that observing our systems was a core concern. I led the creation of an entirely new observability stack with minimal interruption, managed and changed vendors a few times, and contributed to large improvements in reliability and confidence at Stripe through both observability tooling and incident process.

In 2019 I joined SignalFx as a Technical Director, functioning as a Field CTO. My role was a mix of advocacy, customer engagement, and product improvement. Late in 2019 SignalFx was acquired by Splunk.

After Splunk I spent time at Jeli, working on incident analysis and learning. From there I founded Oilcan, where I spent several years building ergonomic on-call tooling aimed at making the lives of on-call engineers less miserable.

I’m now at Airbnb, working on infrastructure engineering. My focus includes large-scale reliability initiatives like SLOs — building the systems and culture that let engineers understand and trust what they’ve built.

I’m often asked by investors to discuss my thoughts of new or existing monitoring products, and I enjoy speaking about these tools with others both to learn and provide my thoughts. I’ve also participated on customer advisory boards, representing my engineering teammates and learning challenges from vendors.

Writing

Observability Crash Course
The CASE Method: Better Monitoring for Humans
A 5 part series on dashboard design:

Speaking

I speak regularly and conferences across the country promoting observability and thoughtful, empathetic operations.

Monitorama 2016:

Monitorama PDX 2016 - Cory Watson - Creating A Culture of Observability at Stripe from Monitorama on Vimeo.

Here are the slides if you prefer to flip through them rather than listen to me talk.

There are also versions of this talk from:

AWS Loft 2019:

I gave at talk at the New York AWS Loft office called “Demystifying Observability” for startups. It’s a combination of beginner info and practical advice for how observability can help you even when you’re just getting started.

Monitorama PDX 2019:

I had the pleasure of giving a 5 minute “vendor talk” at Monitorama PDX 2019. These talks are sometimes product pitches, but more often they are just a chance to speak about something important/interesting for the attendees and maybe mention your product. I decided to talk about how to think about observability tooling inspired by John Allspaw’s “An Open Letter to Monitoring/Metrics/Alerting Companies”.

Monitorama Baltimore 2019:

I spoke about a Dashboard Renaissance, or techniques and processes for making dashboards a more helpful part of your observability and monitoring work. Slides are here.

KubeCon US 2019:

Originally conceived as a set of lessons from my personal role purchasing tools to my job at a vendor where I work with dozens of customers making the same decisions. This talk covers 6 different ways to improve your stance on observability from a social perspective.

SREcon20 Americas

Incidents are an amazing source of education, but we often fail to incorporate the findings into our observability tooling. This talk provides methods for doing just that, with a bit of help from my friends at Jeli.

Podcasts

In March of 2017 I spoke with Software Engineering Daily about my observability work at Stripe.
In April of 2018 I spoke with Software Engineering Daily again about observability pipelines.
In April of 2019 I spoke with Real World DevOps about going from a customer to a vendor and some observability stuff.

Other

Wrote a guest post for Honeycomb talking about making people awesome with instrumentation.
Reviewing posts and early books in the observability space, such as a technical review of Cindy Sridharan’s Distributed Systems Observability.