My father worked with some of the very first computers ever imported to Italy. It was a time when a technician was a temple of excellence built on three pillars: on-the-field experience, a bag of technical manuals, and a fully-stocked toolbox. It was not uncommon that missing the right manual or the correct replacement part turned into a day-long trip from the customers’ site to headquarters and back.
Things are very different now, but not that much. Technical manuals are often embedded and contextual, and we have that little thing called the Internet to answer pretty much any question. Even with all the references easily accessible, the experience will never be replaceable, and that leaves us with the very reason I’m typing these words: the toolbox.
Now, back to my father’s bag of tricks. I still have some pieces from his original set and a few screwdrivers that don’t look special at all. But every time I have to deal with some harder-than-usual screws, I find myself gladly giving up the magnetic tip, rubber grip, and ratcheting mechanism of the newer models and falling back to those old, absolutely normal devices that never fail to deliver.
I mess with Infra all day, and while my toolbox a few years ago was somewhat akin to my father’s, now it looks much more like a Software Engineer’s. I routinely add and remove programs that help me operate with confidence and ease on the very diverse aspects of my job, but I only marginally considered the implications of those choices, until today.
“Trust me, I’m an Engineer” – How I broke production twice in the span of an afternoon
Enter Lens. I’ve been using it for a little over a year now to manage a few dozen Kubernetes clusters. I like the bird’s eye view it gives me, and the ability to deep dive fast into a resource’s spec and, at the same time, quickly glance at the status of a resource and move to correlated ones. It’s a tool that really lets you move faster.
Like any other day, I fire up the Electron-based app to start the day. New update? Cool, apply it. I go on with most of my day without a hiccup, but then it happens. I’m the on-call Engineer for the week and I just got a page for a Pod that’s not processing data. A look at the logs reveals it’s stuck in Full GC; not good. Let me reschedule it (no reason to roll the whole Deployment) while I file a bug with the associated Dev team. Check the Pod name from the page, find it, and click “Delete.”
Except, it’s not the Pod that goes away. It’s the whole Deployment. Cue to a brief moment of incredulity, and then:
Let’s hope it’s a
…nope. All Pods are gone.
"Dude, I need some backup, I messed up"
With the help of a couple of colleagues, I promptly fired up the correct CD pipeline to re-deploy the missing manifest, while alerting the other on-call engineers and my boss. Following the recovery, I took the unavoidable walk of shame in Slack detailing in full what I did to cause and solve the incident.
After a strong coffee, lots of curses, and a rapid assessment of my finger count and the basic ability to type, I went back to the next page: “Kafka topic has under min in-sync replicas.”
A rapid inspection of the worker node showed me that it wasn’t healthy, so it was just a matter of moving the Pod from that StatefulSet to a new node. Been there, done that. Switch region, switch cluster, enter the StatefulSet, locate the pod (it’s easy, it’s the one flagged as
NotReady with a nice ⚠️ ), and click “Delete.”
“Let’s not fat-finger this one too; check the name on the confirmation dialog,
pod/kafka-9889hnj… OK, all right.”
Except, it’s not the Pod that goes away. It’s the whole StatefulSet.
It can’t be, it’s surely an interface glitch…
…nope. All Pods are gone. Just like before.
HOW IS THAT POSSIBLE?!
Now, it was immediately clear to me that this was a different beast. A Deployment may come and go, and some applications or customers might even notice it. But if a datastore goes away, that’s bound to create ripples in a few places.
"Dude, are you still online? It happened again, this time with Kafka."
Despite the situation definitely being worse, I was mindful of the earlier lesson. In the span of a few seconds (no doubt also thanks to the previous generous caffeine intake), I fired a PR to redeploy the missing manifest. While the PR was being applied, I pinged the other on-call engineers and my boss, and started the procedure to declare an Incident right away.
A few minutes later, Kafka was back healthy and fully in sync among all brokers. While waiting for the applications to converge to a steady state, I started questioning myself. How could I make a rookie mistake twice in the span of an afternoon? Was I really that clumsy?
I didn’t start checking for defects in Lens earlier partially due to the on-call duties, and also because I wasn’t really looking for a cheap way out, blaming something other than my own inadequacy. I assumed that it was a human error and went on with my day. But a quick inquiry showed that I had been bitten by this nasty bug:
Then it clicked: this morning’s update! 🤦 I immediately issued a warning in the Incident channel (later propagated to the whole engineering department) to warn other users of this possible behavior.
When the dust settled, I started questioning myself about what happened, especially on the legitimacy of using Lens in a production environment.
Should I have not used it?
Should I have gotten approval for that?
And if so, what could have been the parameters to be evaluated in order to assess the confidence level that to put in a piece of software that, to perform system administration, will have access to the keys of the kingdom?
Do we really need yet another approval process, an explicit and enforced standardization across teams, departments, and organizations?
Stick to the basics
The easiest response would be “stop overcomplicating stuff, and stick to the basic tools.” I have three issues with that approach:
- What’s the definition of “basics?” (complexity)
At first, it might seem an easy question, but things change, and they tend to do so rapidly. Is
kubectl a necessary evil, but
Octant, and any other dashboard really an over-complication? To put it in another way, when does a tool become not just an easing of the way, but something without which I’m unable to accomplish my job?
Scale and size play a role in this aspect; you can’t expect to manage tens or hundreds of Kubernetes clusters, or enrich your production environment with so many features and additional components, and expect not to use some specific tool to address these aspects. Flying too low on the tech stack requires a higher degree of “external” abstraction (human-issued or human-scripted) that, like everything, is prone to error. I feel much more confident leaving
CustomResource handling to
istioctl rather than writing them myself.
- What about productivity? (time)
To me, this is the core of the whole ordeal, the big dilemma. How much of my toolbox can I sacrifice while still being able to move fast? As I learned firsthand, there’s a hidden cost that’s hard to factor in. Sometimes, the added complexity is balanced with a substantial added value, but it’s not easy to draw a line.
As much as this is a non-problem when building stuff, it might get trickier when dealing with Incident Responses when time is of the essence and a smarter tool can make the difference between a quick recovery and a slow one. Chaining tens of commands to retrieve and interpret meaningful data (or touching the right knobs) could make a simple service degradation become a full-blown outage.
- “Basics” can hurt too
No matter how old school you go, there’s always a risk that tooling will bite you. A similar scenario happened a few months back with vanilla kubectl too. Would that have warranted a fallback to using plain API calls with cURL? When the various OpenSSH vulnerabilities plagued the previous generations of SysAdmins (just kidding, it was still me), nobody questioned SSH, prompted to revert to Telnet, or went looking for an alternative.
Going off the opposite side of the spectrum, there would be a “trust nothing, build everything yourself” approach.
I don’t think it’s worth it most of the time, because you probably won’t need a better
kubectl or a more feature-rich
openssl. There is substantial risk associated with overlays that need proper servicing and maintenance (usually piling up on your backlog). Chances are, you’re just wrapping an abstraction layer of business logic around basic tooling to perform more complex tasks. That makes perfect sense, as long as you’re aware of the implications it will carry. Let’s be realistic, your templating engine won’t work better than
A better approach
OK, so what? I’m far from having a definitive answer to the many questions I posed, but if I had to plan to avoid this issue from happening again (take this as a self-imposed post-mortem retrospective), I’d consider the following.
Standardize the toolbox, but leave the choice
It’s unlikeable (and a tad unreasonable) to expect 100% compliance on the “standard” tooling from users. I mean, there’s a reason why protocols are open and standards are a thing, and it’s precisely this.
If you’re not in a regulated industry, where compliance is mandatory (and the above incident would have gotten me written up, or worse) it would be better to mandate the usage of a shared toolbox across the minimal functional unit (team or department), lowering the friction of adding new tools as much as possible and automating the version upgrades from the toolkit itself. This way, you can be sure that all engineers will use a set of tools that is auditable, while also being consistent and without sacrificing effectiveness.
You can leverage some projects to enforce desired versioning:
- At the source (within Git): for example, with
conftestto check semantic logic.
- At the destination, with policies engines like OPA or Kyverno (if you only scope it to Kubernetes).
The bottom line is this: build a shared toolbox, with many utilities, and create a culture of collaboration around it. Do not dismiss edge cases and require only mainstream and well-known instruments, but encourage ownership and evangelism.
Trust, but verify
The tool choice, albeit frictionless in practice, should have some clear admission criteria. These parameters should be good enough to pass an audit from a third-party reviewer. A selection takes into account factors like:
- The size and activity level of the community behind a project. An engaged community will be more responsive to issues and quicker to act on them.
- The release frequency and change lead time. A project released too frequently will require many updates in your toolbox, but if the releases are too far apart, a fix (or feature) may be missing for a long time.
- The software supply chain status. Is there a Software Bill of Materials? Are binaries signed? Do they require obscure
install.shchained from a
cURLto a GitHub
This requires performing some minimum research before considering a new project as a useful addition to the team arsenal.
If you hit a narrow, specific case that needs internally-built tooling, start your project with the early goal of making it open-source. There are so many benefits to this approach that it’s hard to sum them up in a few lines, but to cite a few:
- You’ll be pushed to write better code.
- You’ll be required to automate with transparency.
- The lifecycle maintenance will be shared with the community.
The experience of collaborating on a public project is invaluable, in terms of upskilling and visibility, but it’s often relegated to a tiny fraction of a person’s free time. Giving engineers time to work on public projects during their work time is a form of investing in (or simply giving back to) Open Source that should become a pretty common occurrence in companies.
What’s next (on embracing failure)
Despite being a pretty unpleasant experience, now I think I see more clearly how the tooling is affecting not just the productivity, but also the safety and resiliency of any production Infrastructure I manage. Looking forward, I see a few things happening:
- Reduce the need for tooling (prioritize resiliency)
I wouldn’t have caused the issues I did if I didn’t have to manually intervene in the first place. Tooling is needed when Engineering fails. A Deployment should be able to recover from a Pod that’s not healthy, and a node with degraded network or storage performance should be automatically drained, made out of service, and replaced with a new (hopefully good) one. These are all solved problems, and we should take the time and effort to implement these solutions.
- Introduce Chaos testing (evaluate the infra-app coupling)
Things will break in the most creative and troubling way possible. Following the previous point, we should be able to manage the most common scenarios without manual action. That means not just planning for a failure, but testing that we’re able to take that hit without consequences. There’s been a substantial improvement in the quality of Chaos testing frameworks. It’s easier than ever to be able to define and perform experiments to account for a number of plausible scenarios.
- Drill tests (improve the reaction)
Despite all of this, incidents will still happen. Perform drill tests where engineers have to react on real (non-prod 😄) outages and restore functionality. Define and measure MTTR, maintain up-to-date documentation, and optimize runbooks for quick and effective execution. Review your Incident Management policy. And finally, the difficult one: spread a culture of blameless response, where everyone knows that they’ll be safe to admit mistakes, ask for help, and be open about their limits.