Through a career spanning over twenty years at ISPs, eCommerce shops, and technology giants like Twitter and Stripe I have learned that complex systems will inevitably fail. To repair and expand these systems, we need humans! The goal is to create adaptive capacity; to be resilient.
After joining Twitter in 2012 as one of the first Site Reliability Engineers (SRE) I leaned into observability. While I still consider this work important, I came to the conclusion that charts don’t solve problems, people do. This led me to invest more in learning about resilience engineering and adaptive capacity.
In 2019 I joined SignalFx, later acquired by Splunk and researched resilience engineering and it’s precursors to inform the design of products, organizations, and tooling. In 2021 I joined Jeli to work on incident analysis so that organizations can more easily learn from their work.
- A 3 part series on automation:
- Contributor to Jeli’s Incident Analysis 101 series in Putting It All Together.
- I wrote How to turn an engineering incident into an opportunity for LeadDev.
- Explained micro-learning opportunities for practical, daily improvement.
At RubyConf 2021 I spoke about finding inspiration for resilience in other industries and settings.
- In November of 2021 I spoke with Software Misadventures about failure and success in software engineering.
- Technical Reviewer for Increment: Issue 16, February 2021; Reliability and Increment: Issue 17, May 2021; Containers.
Having spent most of my career on call I believe that organizations can greatly improve the happiness and effectiveness of employees and customers by investing in resilience!