Announcing Reliability Nightmares (the SRE coloring book)

“We often think of movements as starting with a call to action. But movement research suggests that they actually start with emotion — a diffuse dissatisfaction with the status quo and a broad sense that the current institutions and power structures of the society will not address the problem. This brewing discontent turns into a movement when a voice arises that provides a positive vision and a path forward that’s within the power of the crowd.” — Harvard Business Review

The idea behind SIG-SRE started just like that — with emotion, and knowledge that the status quo would not produce the outcomes that Red Hat needs, in the timeframe it needs them, to become competitive in the managed services business.

In addition to the fun video we created for the kickoff, the team has put together a coloring book to highlight how SREs think and how the practice of SRE impacts service using a Kitchen Nightmares restaurant analogy.

The coloring book talks about five principal aspects of SRE and provide a self-assessment scorecard for readers to evaluate themselves:

Observability
Safety and self-healing
Scalability
Shifting left
Zero downtime

Like other Red Hat coloring books, we’ll use it at trade shows in it’s printed form, marketing will do their thing, and as an internal tool for facilitating conversations around building operable software.

Have a look at let us know what you think. Feel free to use it at your company!

Click to access red-hat-sre-coloring-book.pdf

What it’s like to work at Red Hat for 8 years…

I wrote this post in July 2017, and I just found it (4 years later) in my Drafts folder. What a great journey down memory lane it was to read this today. Ask 2017 me if he knew what he’d be into 4 years on, and…yeah. Life moves pretty fast.

We have been doing a lot of hiring lately — I am lucky to be at such a company. It feels like that’s all we’ve done over the time that I’ve been at Red Hat. In every interview I am routinely asked what is it like to work at Red Hat. Mostly I’d pass on a few relevant anecdotes, and move on.

As I’ve just come up on my 8 year anniversary at Red Hat, I thought I would write some of this stuff down to explain more broadly what it’s like to work at Red Hat, more specifically in the few groups I’ve been in, more specifically my personal experience in those groups…

How did I get here?

In 2007, I met Erich Morisse at a RHCA training class at Red Hat’s Manhattan office. Erich was already a Red Hatter, and somehow I ended up with his business card. Fast forward a year or so, and…

I got married in 2008 in Long Island, NY. Within 3 months of that, I applied at Red Hat and flew to Raleigh for the interview. Within 4 months of that, my wife and I moved to Raleigh, and I started at Red Hat on July 20, 2009, as a Technical Account Manager in Global Support Services. I was so excited that they’d have me that the 15% pay cut didn’t bother me.

Life as a TAM

I think there are still TAMs in Red Hat, but I won’t pretend to know what their life is like these days. My experience was filled with learning, lots of pressure and lots of laughs. We had a great group of TAMs…I am still in contact with a few, even 6+ years later. As a TAM we are ultimately tasked with keeping Red Hat’s largest accounts happy with Red Hat (whatever the customer’s definition of happy is). That can mean a variety of things. I personally found that the best way to build and maintain a good relationship was to be onsite with the customer’s technical team as much as possible. While that meant a lot of 6:00am flights, I think it ended up being worth it if only to build up the political capital necessary to survive some of the tickets I’ll describe below.

At the time, TAMs carried about 4-6 accounts, and those accounts largely came from the same vertical, whether it was government, military, national labs, animation studios, and my personal favorite FSI (financial services industry). I gravitated towards the FSI TAMs for a few reasons:

They were the most technical
They had the most pressure
I felt I’d learn from them

I ended up moving to that sub-group and taking on some of the higher profile banks, stock exchanges and hedge funds as my accounts. Supporting those accounts was very challenging for me. I was definitely in over my head, but actually that is where I thrive. For whatever reason, I naturally gravitate towards pressurized, stressful situations. I think there was an experience at a previous job at a datacenter operator (where we were constantly under pressure) that made me learn how to focus under duress and and eventually crave pressure.

I’ll relay two stories from my time as a TAM that I will never forget.

2010: Onsite for a major securities exchange platform launch (moved from Solaris to RHEL). This led to one of the nastiest multi-vendor trouble tickets I was ever on. That ticket also introduced me to Doug Ledford (now one of the Infiniband stack maintainers) and Steven Rostedt (realtime kernel maintainer, sadly now over at VMware). In retrospect I cam see how much I grew during the lifetime of that ticket. I was getting access to some of the best folks in the world (who were also stumped). Helping debug along with them was truly an honor. I think we went through over 40 test kernels to ultimately fix it.
2011: A customer purchases a fleet of server gear that has buggy NICs in every aspect. Firmware is terrible. Drivers are not stable or performant. While the hardware issues were not on my plate, certainly the drivers in the kernel that Red Hat was shipping were very much my responsibility. In this situation, I made several trips out to the customer to ensure them that everything was being done to remedy the situation. I knew this was a serious issue when each time out there I was presenting to higher and higher ranking management. We worked with that vendor daily for quite a while. They fixed bugs in both firmware and driver (upstream), Red Hat kernel folks backported those patches and we tested everything onsite. I don’t know if we got to 40 kernels, but it was at least 20. Plus a dozen or so firmware flashes across roomfuls of machines. This scenario taught me:
- I needed to up level my public speaking experience if I was going to be in rooms with highest levels of management. To do this I joined local Toastmasters club along with another TAM. That other TAM founded Red Hat’s own chapter of Toastmasters, and I was the first to speak at it.
- I should get more hands on experience with high end hardware itself so that I could relate more to the customer’s Ops folks. I ended up working with some gear loaned to me by Red Hat Performance team. They always seemed to have the cool toys.
- More about tc, qdiscs, network buffers, congestion algorithms and systemtap than I’d care to admit.

At time time, I felt like I barely survived. But feedback I received was that I did manage to make the best of bad situations, and the customers are still customers so…mission accomplished. I also became the team lead of the FSI TAMs, and began concentrating on cloning myself by writing documentation, building an onboarding curriculum and interviewing probably 3 people a week for a year.

Becoming a performance engineer

After working with those exchanges, I knew a thing or two about what their requirements were. I got a kick out of system tuning, and wanted to take that to the next level. My opportunity came in a very strange way. Honestly, this is how it happened…I subscribed to as many internal technical mailing lists as I could. Some were wide open and I began monitoring them closely to learn (I still do this).

One day a slide deck was sent out detailing FY12 plans for the performance team. I noted buried towards the end of the deck that they planned on hiring. So, I reached out to the director over there and we had about an hour long conversation as I paced nervously in my laundry room (it’s the only place I could hide from my screaming infants). At the time, that team was based in Westford, MA. I flew up there and did a round of interviews. Within a few days, I was hired and planning my transition out of the support organization.

I believe what got me the job was that I had learned so much low level tracing, and debugging hackery while supporting the FSI sector that I ended up doing very similar work to what was being done on the performance team. And that experience must have shone through.

Being a performance engineer

I remember my first project as a performance engineer: help the KVM team to see if they could use ebtables to build anti-spoofing rules into our hypervisor product called Red Hat Enteprise Virtualization. I remember thinking to myself…oh shit…what is RHEV? What is ebtables? I was under pressure again. Good. Something familiar, at least. To help out the RHEV team I had to quickly learn all of the guts of both topics as well as build load/scale tests to prove out whether it would work or not. I’ll skip to the punchline though…ebtables is abandonware, even 6 years ago. No one cares to fix anything and it’s been on the guillotine for a long time. Based on the issues encountered, I might have been the first (only?) person to really performance and scale test it.

This initial experience was not unlike most experiences on the performance team:

You generally have no clue what the next project will require, so you get very good at soaking up new material.
Don’t be surprised…you are likely the first person to performance or scale test a feature. Get used to it. Developers develop on their laptops.

Most of that is still true to this day — although as time went on, I learned to be more proactive and to engage not only with developers about what they’re working on, but also religiously reading LWN, attending conferences like LinuxCon and like I mentioned, subscribing to as many mailing lists as possible.

The biggest project (not for long) I had on this team was the initial bringup of RHEL7. I look back with great fondness on the years 2012-2014 as I was able to see the construction of the world’s leading Linux distribution from a very unique vantage point: working with the very people who “make RHEL feel like RHEL”. That is … debating over kernel configs…backwards compatibility discussions…working with partners to align hardware roadmaps…GA/launch benchmark releases…can we do something like kSplice…will we reduce CONFIG_HZ.

This last bit brings me to the part of RHEL7 that I had the most to do with…timers. As the vast majority of financial transactions happening on stock exchanges occur on RHEL, we had to pay very close attention to the lowest levels of performance. Timers are an area only the smartest, bravest kernel developers fear to tread. Our goal was to build NOHZ_FULL and test the hell out of it. Nowadays we take this feature for granted in both the financial industry as well as telco where without nohz_full (I am told), all the worlds packets will be a few microseconds late. And that is not good.

You can see some of my nohz_full work here (or read the RHEL docs on the subject, as I wrote those too).

While Red Hat was not my first job, I do consider Red Hat my first (job) love. It is the first job I had that I’d call career-worthy, in that I could see myself working here for a while (there was plenty of work and the company was growing).

Tweaking my webcam setup (Logitech C930e, Fedora Linux, v4l)

I came really close to making some large purchases to improve my video call situation, and I may still do so, but I did spend some time and found a few quick wins maybe will help others:

I use Fedora as my workstation. Won’t change it.
I have a Logitech C930e. Logitech doesn’t publish anything on how to tweak the camera on Linux. Figure out how to tweak it.
I like (love) working in the dark. So I never have any lights on. That has to change.
I have a window behind me in my home office. Shutting the blinds is not enough. Repositioning my desk won’t work in the space I’ve got. Get a blackout curtain.
My webcam is sitting on top of my monitor, dead center. This makes it really awkward to look directly at. It’s about 8″ above my eye-line. I don’t think I’m going to change this. My eyeline has to remain at the center of the monitor or I get neck pain.

Here are the tweaks I made that do seem to have improved things:

```
dnf install v4l2ucp v4l-utils
```

v4l2-ctl --set-ctrl zoom_absolute=125 # this helps with the "know your frame" part https://www.youtube.com/watch?v=g2wH36xzs_M.  This camera has a really wide FoV, so this shrinks it down a bit.

v4l2-ctl --set-ctrl tilt_absolute=-36000 # this helps tilt the center of the camera frame down towards where I'm sitting (8" below camera).

v4l2-ctl --set-ctrl sharpness=150 # This seemed to help the most.  I tried a bunch of values and 150 is best for my office.

Lighting: Instead of having my desk lamp illuminate my keyboard, turn it 180degrees to bounce off the white ceiling. Big improvement.
Lighting: You can’t work in the dark anymore.
Auto-focus: I have a TV just to my right. When whatever’s on changes brightness, it makes the camera autofocus freak out. I typically keep the TV muted. Now I’ll pause it while on calls.
Microphone: I have an external USB mic (Blue Yeti). Turns out I had it in the wrong position relative to how I’m sitting. Thanks to a co-workers “webcam basics” slides for that tip (confirmed in Blue Yeti docs).
Despite that, after recording a video meeting of just myself I still didn’t like how the audio turned out. So I bought an inexpensive lavalier microphone from Amazon, figured it’s worth a try.

One thing I cannot figure out how to do is bokeh. I think that remains a gap between what I’ve got now and higher end gear.

Red Hat SRE 2020 Goals and Projects (and Hiring!)

Hey all, happy new year!

Been a quarter or so since my last post

Wanted to share some updates info about the Service Delivery, SRE team at Red Hat for 2020!

Some of our top level goals:

Improve observability – tracing, log analysis
Improve reliability – load shedding, autoscaling
Launch a boatload of features
Establish mathematically proveable release criteria
Increased Capacity Planning and demand forecasting for production services
Widen availability to additional regions and cloud providers (have you built and supported production services on GCP?)

We’ve got several openings. They’re all REMOTE-FRIENDLY! Don’t worry about the specific titles – they’re a biproduct of how RH backend systems work.

If you think you check the majority of boxes on the job posting, and ESPECIALLY if you’ve done any of these things in the past…please ping us. We’re actively hiring into all of these roles ASAP.

So come have some fun with us. Build and run cool shit. Be at the forefront of operationalizing OpenShift 4. Develop in go. Release continuously.

Openings as of 19-Jan-2020

China (AWS China knowledge desired)

Senior SRE
https://global-redhat.icims.com/jobs/75511/senior-site-reliability-engineer/job?hub=7

SRE https://global-redhat.icims.com/jobs/75764/site-reliability-engineer/job?hub=7

India
https://global-redhat.icims.com/jobs/72421/senior-software-engineer—service-reliability%2c-red-hat-openshift/job?hub=7

North America, Israel
Principal SRE

https://us-redhat.icims.com/jobs/75753/principal-software-engineer—openshift-sre/job

Senior SRE https://us-redhat.icims.com/jobs/73518/senior-software-engineer/job?hub=7

Senior SRE https://global-redhat.icims.com/jobs/75022/senior-service-reliability-engineer—devops-and-ci/job?hub=7

SRE
https://us-redhat.icims.com/jobs/75758/software-engineer—openshift-sre/job?hub=7

Senior Security Software Engineer https://us-redhat.icims.com/jobs/68256/senior-security-software-engineer/job?hub=7

Maybe Stop Sending Me Emails about Performance :-)

[I’ve been meaning to write this post for several months]

Earlier this year I changed roles within Red Hat. My new role is “OpenShift SaaS Architect”, and organizationally is part of Red Hat Service Delivery.

Service Delivery encompasses:

SRE teams that design, build and operate Red Hat OpenShift Dedicated and Azure Red Hat OpenShift
Development teams responsible for cluster and account management microservices behind cloud.redhat.com
SRE teams that run core OpenShift Container Platform infrastructure (such as Cincinnati, Telemeter, try.openshift.com and many more).
Teams building OperatorHub.io tooling, certification programs and Operator SDK training (delivered dozens of Operator trainings for our customers)

Basically, if you’ve had any interaction with OpenShift 4, you’ve likely consumed those services.

I’d been in my previous role for 7 years, and celebrated my 10th anniversary at Red Hat by being acquired by Big Blue. My previous team (Red Hat Performance and Scale) afforded me endless technical challenges, opportunities to travel, present, help shape product and build engineering teams from the ground up. Perhaps most importantly, I had the opportunity to mentor as many Red Hatters as I possibly could.

Red Hat Service Delivery allows me to broaden my technical and architecture skill set to areas outside of performance, scale and optimization, while letting me apply the many hard-fought lessons from prior chapters in my career.

Hopefully $subject makes a bit more sense now. Onward!

Building Grafana from source on Fedora

Here are the official docs for building Grafana from source. And below are my notes on how to build Grafana, starting from a clean Fedora 27 Cloud image.

# Install Dependencies
curl https://dl.yarnpkg.com/rpm/yarn.repo > /etc/yum.repos.d/yarn.repo
sudo yum install golang yarn rubygems ruby-devel redhat-rpm-config rpm-build git -y
gem install fpm
sudo yarn install --pure-lockfile
npm install -g yarn & yarn install

Setup the go environment.

# go environment
mkdir ~/go
export GOPATH=~/go
export PATH=$PATH:$(go env GOPATH)/bin

Download the various repositories required to build. Here you could also clone your fork/branch of Grafana into $GOPATH/src.

# Pull sources required to build
go get github.com/grafana/grafana golang.org/x/sync/errgroup github.com/codegangsta/cli 
cd $GOPATH/src/github.com/grafana/grafana
npm install

Now you can make any sort of local changes, or just build from HEAD.

# go run build.go setup # takes 45 seconds
$ time go run build.go build pkg-rpm # takes about 7 minutes

The build will spit out an RPM in a folder called dist:

Created package {:path=>"./dist/grafana-5.0.0-1517715437pre1.x86_64.rpm"}

Building KDAB hotspot ‘perf’ visualization tool on Fedora

As any respectable software performance person knows, perf is your best (only?) friend. For example, perf report -g has shined a light into the deepest, darkest corners of debugging territory. Since you asked, it can happily run in a container, too (albeit requiring elevated privileges, but we’re debugging here…).

Typically console formatted output is fine for grokking perf reports, but having recently become addicted to go’s pprof visualization (dot format), handy flame graphs, and on the morbid occassion, VTune, I started looking around for a way to more clearly understand a particular perf recording.

Googling turned up an interesting QT-based tool called hotspot by a company called KDAB. Screenshots indicate it might be worth kicking the tires.

After some bouncing around figuring out Fedora equivalent package names, I was able to quickly build and run hotspot. I ran a quick perf record to see if it was going to work at all:

$ sudo perf record --call-graph dwarf sleep 10
$ ./bin/hotspot ./perf.data

And voila…

summary

Folks at KDAB even included a built-in flame graph:

flame_graph

Interface is clean, bug-free and useful. Trying to load a large perf.data file was a bit ugly and RAM-intensive; I would likely stick to command-line parsing for those. Or, as we do in pbench, reduce the collection frequency to 100Hz and take bite-sized samples over the life of the test.