What it’s like to work at Red Hat for 8 years…

I wrote this post in July 2017, and I just found it (4 years later) in my Drafts folder.  What a great journey down memory lane it was to read this today.  Ask 2017 me if he knew what he’d be into 4 years on, and…yeah.  Life moves pretty fast.

We have been doing a lot of hiring lately — I am lucky to be at such a company.  It feels like that’s all we’ve done over the time that I’ve been at Red Hat.  In every interview I am routinely asked what is it like to work at Red Hat. Mostly I’d pass on a few relevant anecdotes, and move on.

As I’ve just come up on my 8 year anniversary at Red Hat, I thought I would write some of this stuff down to explain more broadly what it’s like to work at Red Hat, more specifically in the few groups I’ve been in, more specifically my personal experience in those groups…

How did I get here?

In 2007, I met Erich Morisse at a RHCA training class at Red Hat’s Manhattan office.  Erich was already a Red Hatter, and somehow I ended up with his business card.  Fast forward a year or so, and…

I got married in 2008 in Long Island, NY.  Within 3 months of that, I applied at Red Hat and flew to Raleigh for the interview.  Within 4 months of that, my wife and I moved to Raleigh, and I started at Red Hat on July 20, 2009, as a Technical Account Manager in Global Support Services.  I was so excited that they’d have me that the 15% pay cut didn’t bother me.

Life as a TAM

I think there are still TAMs in Red Hat, but I won’t pretend to know what their life is like these days.  My experience was filled with learning, lots of pressure and lots of laughs.  We had a great group of TAMs…I am still in contact with a few, even 6+ years later.  As a TAM we are ultimately tasked with keeping Red Hat’s largest accounts happy with Red Hat (whatever the customer’s definition of happy is).  That can mean a variety of things.  I personally found that the best way to build and maintain a good relationship was to be onsite with the customer’s technical team as much as possible.  While that meant a lot of 6:00am flights, I think it ended up being worth it if only to build up the political capital necessary to survive some of the tickets I’ll describe below.

At the time, TAMs carried about 4-6 accounts, and those accounts largely came from the same vertical, whether it was government, military, national labs, animation studios, and my personal favorite FSI (financial services industry).  I gravitated towards the FSI TAMs for a few reasons:

  • They were the most technical
  • They had the most pressure
  • I felt I’d learn from them

I ended up moving to that sub-group and taking on some of the higher profile banks, stock exchanges and hedge funds as my accounts.  Supporting those accounts was very challenging for me.  I was definitely in over my head, but actually that is where I thrive.  For whatever reason, I naturally gravitate towards pressurized, stressful situations.  I think there was an experience at a previous job at a datacenter operator (where we were constantly under pressure) that made me learn how to focus under duress and and eventually crave pressure.

I’ll relay two stories from my time as a TAM that I will never forget.

  • 2010:  Onsite for a major securities exchange platform launch (moved from Solaris to RHEL).  This led to one of the nastiest multi-vendor trouble tickets I was ever on.  That ticket also introduced me to Doug Ledford (now one of the Infiniband stack maintainers) and Steven Rostedt (realtime kernel maintainer, sadly now over at VMware).  In retrospect I cam see how much I grew during the lifetime of that ticket.  I was getting access to some of the best folks in the world (who were also stumped).  Helping debug along with them was truly an honor.  I think we went through over 40 test kernels to ultimately fix it.
  • 2011:  A customer purchases a fleet of server gear that has buggy NICs in every aspect.  Firmware is terrible.  Drivers are not stable or performant.  While the hardware issues were not on my plate, certainly the drivers in the kernel that Red Hat was shipping were very much my responsibility.  In this situation, I made several trips out to the customer to ensure them that everything was being done to remedy the situation.  I knew this was a serious issue when each time out there I was presenting to higher and higher ranking management.  We worked with that vendor daily for quite a while.  They fixed bugs in both firmware and driver (upstream), Red Hat kernel folks backported those patches and we tested everything onsite.  I don’t know if we got to 40 kernels, but it was at least 20.  Plus a dozen or so firmware flashes across roomfuls of machines.  This scenario taught me:
    • I needed to up level my public speaking experience if I was going to be in rooms with highest levels of management.  To do this I joined local Toastmasters club along with another TAM.  That other TAM founded Red Hat’s own chapter of Toastmasters, and I was the first to speak at it.
    • I should get more hands on experience with high end hardware itself so that I could relate more to the customer’s Ops folks.  I ended up working with some gear loaned to me by Red Hat Performance team.  They always seemed to have the cool toys.
    • More about tc, qdiscs, network buffers, congestion algorithms and systemtap than I’d care to admit.

At time time, I felt like I barely survived.  But feedback I received was that I did manage to make the best of bad situations, and the customers are still customers so…mission accomplished.  I also became the team lead of the FSI TAMs, and began concentrating on cloning myself by writing documentation, building an onboarding curriculum and interviewing probably 3 people a week for a year.

Becoming a performance engineer

After working with those exchanges, I knew a thing or two about what their requirements were.  I got a kick out of system tuning, and wanted to take that to the next level.  My opportunity came in a very strange way.  Honestly, this is how it happened…I subscribed to as many internal technical mailing lists as I could.  Some were wide open and I began monitoring them closely to learn (I still do this).

One day a slide deck was sent out detailing FY12 plans for the performance team.  I noted buried towards the end of the deck that they planned on hiring.  So, I reached out to the director over there and we had about an hour long conversation as I paced nervously in my laundry room (it’s the only place I could hide from my screaming infants).  At the time, that team was based in Westford, MA.  I flew up there and did a round of interviews.  Within a few days, I was hired and planning my transition out of the support organization.

I believe what got me the job was that I had learned so much low level tracing, and debugging hackery while supporting the FSI sector that I ended up doing very similar work to what was being done on the performance team.  And that experience must have shone through.

Being a performance engineer

I remember my first project as a performance engineer:  help the KVM team to see if they could use ebtables to build anti-spoofing rules into our hypervisor product called Red Hat Enteprise Virtualization.  I remember thinking to myself…oh shit…what is RHEV?  What is ebtables?  I was under pressure again.  Good.  Something familiar, at least.  To help out the RHEV team I had to quickly learn all of the guts of both topics as well as build load/scale tests to prove out whether it would work or not.  I’ll skip to the punchline though…ebtables is abandonware, even 6 years ago.  No one cares to fix anything and it’s been on the guillotine for a long time.  Based on the issues encountered, I might have been the first (only?) person to really performance and scale test it.

This initial experience was not unlike most experiences on the performance team:

  • You generally have no clue what the next project will require, so you get very good at soaking up new material.
  • Don’t be surprised…you are likely the first person to performance or scale test a feature.  Get used to it.  Developers develop on their laptops.

Most of that is still true to this day — although as time went on, I learned to be more proactive and to engage not only with developers about what they’re working on, but also religiously reading LWN, attending conferences like LinuxCon and like I mentioned, subscribing to as many mailing lists as possible.

The biggest project (not for long) I had on this team was the initial bringup of RHEL7.  I look back with great fondness on the years 2012-2014 as I was able to see the construction of the world’s leading Linux distribution from a very unique vantage point:  working with the very people who “make RHEL feel like RHEL”.  That is … debating over kernel configs…backwards compatibility discussions…working with partners to align hardware roadmaps…GA/launch benchmark releases…can we do something like kSplice…will we reduce CONFIG_HZ.

This last bit brings me to the part of RHEL7 that I had the most to do with…timers.  As the vast majority of financial transactions happening on stock exchanges occur on RHEL, we had to pay very close attention to the lowest levels of performance.  Timers are an area only the smartest, bravest kernel developers fear to tread.  Our goal was to build NOHZ_FULL and test the hell out of it.  Nowadays we take this feature for granted in both the financial industry as well as telco where without nohz_full (I am told), all the worlds packets will be a few microseconds late.  And that is not good.

You can see some of my nohz_full work here (or read the RHEL docs on the subject, as I wrote those too).

While Red Hat was not my first job, I do consider Red Hat my first (job) love.  It is the first job I had that I’d call career-worthy, in that I could see myself working here for a while (there was plenty of work and the company was growing).

Service Delivery: Eight weeks in …

I wrote this post in May 2019, shortly after changing roles into the Red Hat Service Delivery SRE team, which is the group building Red Hat’s -as-a-Service tech, at the time largely OpenShift itself (but since has grown wildly). The external version you’re reading has been very lightly edited for internal stuff not relevant to the overall message.

I’ve found myself going back to it often to remind myself of my core own core values, sharing and re-sharing it internally when I feel we get off course, and also putting it in front of new hires so they understand where I am coming from. As expected, some of my thinking has evolved in the intervening years, but I feel the content has aged fairly well.

Service Delivery: Reliable, Secure, Performant, and…Boring?

They have a saying in the American sport of baseball: “a well-umpired game is one in which you don’t even remember seeing the umpires”. In baseball, it’s the umpire’s job to make all aspects of a game reliable (predictable, that is – with little variance), secure (dependable and without prejudice), performant (keep the pace of play) and otherwise as “boring” as possible.

What could this mean from the viewpoint of Red Hat Service Delivery? How does this map to running managed services? As employees running a managed service, we’re the umpires. It’s our job to offer dependable service that meets our personal standards as well as contractual SLAs. I don’t know about you, but my standards for a service, e.g. one I’d be happy to pay for, is higher than the SLA from the company. In other words, we should aim to exceed our customer’s expectations.

Building mature services takes time. It also takes teamwork and communication. We should endeavor to have the same levels of integrity and trust in each other that we hope to earn from our customers. It’s what we do under pressure that helps build trust. And while we should strive for perfection, it is much more important that if we fall down seven times, we get up eight.

Guiding Philosophies

How do we do this?

Above all, it means integrity, dependability and trustworthiness. That our customers can trust we’ll do the right thing, even when no one’s looking. And when we slip up, we work diligently to fix it so it’ll never happen again.

How do we achieve this?

A silly (yet applicable) example from my personal life: I’ve been a customer of Google Music for many years. Because they’ve let the app languish, I decided to look for alternatives. I know that Spotify is very popular. However, as a smaller company (and in light of the depth of experience that Google has running managed services) I thought Spotify’s service might not live up to my expectations. I was wrong. Because Spotify is so focused on their one product (streaming music), and chiefly concerned with user experience and design, it felt like they cared about me as a user. Spotify exceeded my expectations not only with their focus on me, but their service is rock solid too (which I did not anticipate!).

Let’s be like Spotify, and exceed our customer’s expectations.

We’ve got demands coming from customers, internal product teams, the shifting sands of upstream and our competitive landscape itself. How do we exceed customer expectations, and reliably serve those many masters?

I think there’s a tractable approach – these groups do have something in common.

First, they want a reliable, secure and performant substrate upon which to build their product or business. They want our managed services to be boring – but not TOO boring. We need to make sure to fix bugs before adding new features. This does not mean we stop adding new features – it means we add them responsibly. Together we build the CI/CD workflows, coding practices and monitoring that de-risks the rollout of new features.

Let’s get away from the waterfall mindset, and ship code hourly. Let’s build the confidence and scaffolding that lets us test things in production.

Code Quality and Testing

How do we build that confidence? Some will come with experience, yes. But the majority needs to come from thoughtfully designed implementations, well-written, defensive coding practices, pre-mortems, and significant, thorough, end to end test coverage. These are the attributes that Red Hat preaches to our customers who engage in the Red Hat Open Innovation Labs.

Let’s hold each other to these standards – clearly communicated constructive criticism is a gift.

With the right testing in place, we can establish backpressure mechanisms back into our upstreams and infrastructure providers. A well-honed and managed CI pipeline for each component, along with a comprehensive end-to-end test suite should be at the forefront of our minds when writing code or fixing production issues. Pre-mortems leading to new tests. Post-mortems leading to new tests. Rinse and repeat.

Observability, Contributing Code, and Competition

An observable system is one that can be easily understood. What has it done, what is it doing, what will it do. Observable systems have the scaffolding built in to make it easy for anyone to answer these questions (not just the author of the code). We’ll be more productive, from developers through support, if our systems can be easily understood. Developing systems that interact responsibly with each other takes Systems Thinking. Watch this 60-second video for a quick overview.

How does this tie back to contributing code? If you understand just a bit more of the entire system than the code you’re responsible for, the code you’re responsible for will improve. You will begin to know what counters need to be put in place, what tracepoints will be helpful, how to create useful (actionable) logs that make your system understandable. We’ve all supported or debugged a poorly instrumented program.

Let’s build observable systems and dashboards that help enable higher service levels, reduce MTTR and exceed our customer’s expectations.

One of my favorite quotes shouldn’t surprise you (I did just spend a decade working on performance):

“I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be.”

Lord Kelvin

Let’s be scientists.

Another upside to observable systems – we can’t let competitors know our customers better than we do. Observable systems lead to “digital exhaust” that can drive customer happiness in a variety of ways: proactively addressing bugs, prioritizing feature enhancements, and driving down our cost-to-serve through automating away toil are just the start. Our platforms, software, and customers are trying to telegraph us what to do – all we need to do is build systems that know how to listen.

Let’s build systems that tell us which features we should dark-launch first!

Finally, we are encouraged by the engineering management of adjacent and dependent teams to provide feedback as well as contributing code-fixes directly. If you’ve identified a bug and can fix it, just go fix it. Feel empowered (and highly encouraged) to do so.

How do we learn?

First – if you don’t have time to improve how you work, you have a shutdown problem. You will not be able to sustain or grow in any meaningful way. I think we’ve all eschewed strategic improvements for tactical ones and sometimes regretted it. While there is a time and place for “just fix it”, a well-honed and consistent workflow pays off in multiples.

New Goals

We need to anticipate growth. This means turning on new services becomes second nature. We need to have sufficient confidence in our processes that change becomes good, something we are all comfortable with, and that creates an environment and mindset to help us move fast(er).

The GitOps workflow embodied in [our internal] SRE contract is how we do that in a supportable way, at scale.

Last “New Goal”: empower individual teams to decide when to stop doing something. If you can back it up by providing supporting metrics, (e.g. deprecate feature-X because it is never used), that should be an easy discussion with PM, engineering and BU teams.

Be on the lookout for these opportunities. Not only to reduce overhead and deliver simpler, more supportable systems, but when we stop doing something, we can start doing something else. Something a little cooler.

Our first Pre-Mortem: what we can’t succeed without

Let’s game out our first pre-mortem. What are some things that, if we don’t do, we will fail?

  • Predictable release timing, non-waterfall
  • Extremely well-tested code
  • Upstream integration testing and release gates
  • Participation in product development
  • Observability – EVERY SERVICE gets a dashboard (or two) and alerts
  • Low overhead, mesh communications between teams
  • Approach software development with SLO’s top of mind
  • Commit to eliminate a bit of toil each week with good engineering
  • Surprises always happen, but we should have clear product requirements to make sure we’re going in the right direction
  • Invest in our infrastructure – building features is awesome, but we must make sure we also invest in our infrastructure, build a solid one and improve it each sprint
  • Clear understanding of how each of the many components in the system are supported. When it breaks, who is page, responsible for triage, fixing it, etc.

To close out – have you ever had a leak in your roof? Building shiny features is like installing solar panels to stop the leak. Solar panels themselves are a great idea…but they won’t fix your roof.