I wrote this post in May 2019, shortly after changing roles into the Red Hat Service Delivery SRE team, which is the group building Red Hat’s -as-a-Service tech, at the time largely OpenShift itself (but since has grown wildly). The external version you’re reading has been very lightly edited for internal stuff not relevant to the overall message.
I’ve found myself going back to it often to remind myself of my core own core values, sharing and re-sharing it internally when I feel we get off course, and also putting it in front of new hires so they understand where I am coming from. As expected, some of my thinking has evolved in the intervening years, but I feel the content has aged fairly well.
Service Delivery: Reliable, Secure, Performant, and…Boring?
They have a saying in the American sport of baseball: “a well-umpired game is one in which you don’t even remember seeing the umpires”. In baseball, it’s the umpire’s job to make all aspects of a game reliable (predictable, that is – with little variance), secure (dependable and without prejudice), performant (keep the pace of play) and otherwise as “boring” as possible.
What could this mean from the viewpoint of Red Hat Service Delivery? How does this map to running managed services? As employees running a managed service, we’re the umpires. It’s our job to offer dependable service that meets our personal standards as well as contractual SLAs. I don’t know about you, but my standards for a service, e.g. one I’d be happy to pay for, is higher than the SLA from the company. In other words, we should aim to exceed our customer’s expectations.
Building mature services takes time. It also takes teamwork and communication. We should endeavor to have the same levels of integrity and trust in each other that we hope to earn from our customers. It’s what we do under pressure that helps build trust. And while we should strive for perfection, it is much more important that if we fall down seven times, we get up eight.
How do we do this?
Above all, it means integrity, dependability and trustworthiness. That our customers can trust we’ll do the right thing, even when no one’s looking. And when we slip up, we work diligently to fix it so it’ll never happen again.
How do we achieve this?
A silly (yet applicable) example from my personal life: I’ve been a customer of Google Music for many years. Because they’ve let the app languish, I decided to look for alternatives. I know that Spotify is very popular. However, as a smaller company (and in light of the depth of experience that Google has running managed services) I thought Spotify’s service might not live up to my expectations. I was wrong. Because Spotify is so focused on their one product (streaming music), and chiefly concerned with user experience and design, it felt like they cared about me as a user. Spotify exceeded my expectations not only with their focus on me, but their service is rock solid too (which I did not anticipate!).
Let’s be like Spotify, and exceed our customer’s expectations.
We’ve got demands coming from customers, internal product teams, the shifting sands of upstream and our competitive landscape itself. How do we exceed customer expectations, and reliably serve those many masters?
I think there’s a tractable approach – these groups do have something in common.
First, they want a reliable, secure and performant substrate upon which to build their product or business. They want our managed services to be boring – but not TOO boring. We need to make sure to fix bugs before adding new features. This does not mean we stop adding new features – it means we add them responsibly. Together we build the CI/CD workflows, coding practices and monitoring that de-risks the rollout of new features.
Let’s get away from the waterfall mindset, and ship code hourly. Let’s build the confidence and scaffolding that lets us test things in production.
Code Quality and Testing
How do we build that confidence? Some will come with experience, yes. But the majority needs to come from thoughtfully designed implementations, well-written, defensive coding practices, pre-mortems, and significant, thorough, end to end test coverage. These are the attributes that Red Hat preaches to our customers who engage in the Red Hat Open Innovation Labs.
Let’s hold each other to these standards – clearly communicated constructive criticism is a gift.
With the right testing in place, we can establish backpressure mechanisms back into our upstreams and infrastructure providers. A well-honed and managed CI pipeline for each component, along with a comprehensive end-to-end test suite should be at the forefront of our minds when writing code or fixing production issues. Pre-mortems leading to new tests. Post-mortems leading to new tests. Rinse and repeat.
Observability, Contributing Code, and Competition
An observable system is one that can be easily understood. What has it done, what is it doing, what will it do. Observable systems have the scaffolding built in to make it easy for anyone to answer these questions (not just the author of the code). We’ll be more productive, from developers through support, if our systems can be easily understood. Developing systems that interact responsibly with each other takes Systems Thinking. Watch this 60-second video for a quick overview.
How does this tie back to contributing code? If you understand just a bit more of the entire system than the code you’re responsible for, the code you’re responsible for will improve. You will begin to know what counters need to be put in place, what tracepoints will be helpful, how to create useful (actionable) logs that make your system understandable. We’ve all supported or debugged a poorly instrumented program.
Let’s build observable systems and dashboards that help enable higher service levels, reduce MTTR and exceed our customer’s expectations.
One of my favorite quotes shouldn’t surprise you (I did just spend a decade working on performance):
Let’s be scientists.
Another upside to observable systems – we can’t let competitors know our customers better than we do. Observable systems lead to “digital exhaust” that can drive customer happiness in a variety of ways: proactively addressing bugs, prioritizing feature enhancements, and driving down our cost-to-serve through automating away toil are just the start. Our platforms, software, and customers are trying to telegraph us what to do – all we need to do is build systems that know how to listen.
Let’s build systems that tell us which features we should dark-launch first!
Finally, we are encouraged by the engineering management of adjacent and dependent teams to provide feedback as well as contributing code-fixes directly. If you’ve identified a bug and can fix it, just go fix it. Feel empowered (and highly encouraged) to do so.
How do we learn?
First – if you don’t have time to improve how you work, you have a shutdown problem. You will not be able to sustain or grow in any meaningful way. I think we’ve all eschewed strategic improvements for tactical ones and sometimes regretted it. While there is a time and place for “just fix it”, a well-honed and consistent workflow pays off in multiples.
We need to anticipate growth. This means turning on new services becomes second nature. We need to have sufficient confidence in our processes that change becomes good, something we are all comfortable with, and that creates an environment and mindset to help us move fast(er).
The GitOps workflow embodied in [our internal] SRE contract is how we do that in a supportable way, at scale.
Last “New Goal”: empower individual teams to decide when to stop doing something. If you can back it up by providing supporting metrics, (e.g. deprecate feature-X because it is never used), that should be an easy discussion with PM, engineering and BU teams.
Be on the lookout for these opportunities. Not only to reduce overhead and deliver simpler, more supportable systems, but when we stop doing something, we can start doing something else. Something a little cooler.
Our first Pre-Mortem: what we can’t succeed without
Let’s game out our first pre-mortem. What are some things that, if we don’t do, we will fail?
- Predictable release timing, non-waterfall
- Extremely well-tested code
- Upstream integration testing and release gates
- Participation in product development
- Observability – EVERY SERVICE gets a dashboard (or two) and alerts
- Low overhead, mesh communications between teams
- Approach software development with SLO’s top of mind
- Commit to eliminate a bit of toil each week with good engineering
- Surprises always happen, but we should have clear product requirements to make sure we’re going in the right direction
- Invest in our infrastructure – building features is awesome, but we must make sure we also invest in our infrastructure, build a solid one and improve it each sprint
- Clear understanding of how each of the many components in the system are supported. When it breaks, who is page, responsible for triage, fixing it, etc.
To close out – have you ever had a leak in your roof? Building shiny features is like installing solar panels to stop the leak. Solar panels themselves are a great idea…but they won’t fix your roof.