TOPOLOGY

The Four People Who Built What We Call AI

Sidhartha Mandal (sidh4u) — Thu, 07 May 2026 08:04:05 GMT

I use Claude every day. It has quietly become part of how I think - a place to stress-test ideas, draft things, ask questions I’d be embarrassed to Google. After a while you stop noticing the tool and just notice the thinking.

Then one afternoon I asked it something I hadn’t thought to ask before: Why are you called Claude?

The answer - and the rabbit hole that followed - turned into this article. Because the name points to a person. And that person points to a moment in the 1940s when three or four human beings, working within a few years of each other, laid down the mathematics and the questions and the vocabulary that everything we call AI is built on.

One of those people is still alive. He just won a Nobel Prize. And he’s worried.

Let’s start with the name.

Why Claude? And Why Anthropic?

Anthropic - the company that built Claude - takes its name from the Greek word anthropos, meaning human. It’s a values statement embedded in the company name: we are building AI aligned with human interests. That’s the mission, right there in the etymology.

The product name is a different kind of tribute. Claude is almost certainly named after Claude Shannon - not officially confirmed, but widely acknowledged inside Anthropic and obvious once you understand what Shannon actually did. The AI built on the mathematics of information is named after the man who invented information theory itself.

Which raises the question: What did Claude Shannon actually do?

Claude Shannon - The Man Who Invented Information

Bell Labs, 1948

In 1948, a mathematician at Bell Labs named Claude Shannon published a 77-page paper called A Mathematical Theory of Communication. It contains, tucked inside the mathematics, one of the most consequential inventions of the twentieth century.

The bit.

Not the bit as a vague concept. Shannon defined it formally: the fundamental unit of information, the answer to a single yes/no question, true/false question, represented as 0 or 1. Before Shannon, information was fuzzy - you could say a message was long or complex but you couldn’t measure it. Shannon gave information a unit, the way temperature has degrees and distance has metres.

His deeper insight was that any information - text, sound, images, anything - could be broken down into a sequence of these binary choices. It didn’t matter what the content meant. What mattered was how many yes/no decisions were needed to encode it.

Every hard drive measured in terabytes, every internet connection in megabits per second, every compressed image, every encrypted message - all of it is Shannon’s mathematics, running silently underneath.

But there’s something else in that 1948 paper that most people miss. Shannon described how systems that generate sequences probabilistically - choosing the next symbol based on the current state - could produce outputs that resemble language. He was describing, in 1948, the conceptual skeleton of what we now call a large language model (LLM). He didn’t have computers powerful enough to build one. He had the mathematics.

Shannon was also, by all accounts, a wonderfully strange person - and the strangeness is worth mentioning because it shaped how he thought. He rode a unicycle through the Bell Labs corridors. He juggled. He built a machine called Theseus - a mechanical mouse that could navigate a maze, remember the solution, and find its way back on subsequent attempts. In 1951, he calculated the information content of English and concluded that roughly half of every sentence is redundant - which is why you can rd ths sntnc wth hlf th ltrs mssng and still understand it perfectly. That wasn’t a party trick. It was the information theory, applied.

That Theseus machine is worth pausing on. A system that attempts a problem, remembers what worked, and performs better on the next attempt - that is not a historical curiosity. That is the core loop of modern machine learning. The maze is different. The scale is unimaginable by 1950 standards. But the logic - try, remember, improve - is the same logic that runs inside every AI model being trained today. Shannon built it out of copper and relays in a Bell Labs workshop. We now run it on hundreds of thousands of GPUs simultaneously. The idea didn’t change. Only the hardware did.

He published the rulebook for the digital age in 1948 - before most computers existed. We are still playing by his rules.

Alan Turing - The Man Who Asked the Question

Manchester, 1950

Two years after Shannon’s paper, a British mathematician named Alan Turing published one of his own. It opens with a sentence that changed the direction of science:

“I propose to consider the question: Can machines think?”

Turing had already spent the previous decade doing work that most people still don’t know about. In 1936, at 24 years old, he had published his theory of the Turing Machine - an abstract model of computation that described, for the first time, exactly what any computer could and could not do. Not a specific machine. Any machine. The theoretical limits of the entire category, defined fourteen years before most computers existed.

Then came the war. Turing worked at Bletchley Park and was instrumental in breaking the Nazi Enigma cipher. Historians estimate this shortened the war by two to four years - a number that translates into millions of lives.

Back to 1950. Turing knew the question “Can machines think?” was unanswerable as stated, so he replaced it with something testable: the Imitation Game, which we now call the Turing Test. Put a human judge in text conversation with both a human and a machine. If the judge cannot reliably tell which is which, the machine has passed. It is still debated endlessly. But Turing’s real contribution wasn’t the test - it was the insistence that the question deserved to be asked seriously. That machine intelligence was a scientific problem, not a philosophical fantasy.

In 1952, the British government prosecuted him for homosexuality. He was subjected to chemical castration. He died in 1954, aged 41. He received a royal pardon in 2013.

The Turing Award - computing’s Nobel Prize - has been given in his name every year since 1966. One of its recipients, decades later, would finally build what Turing had only imagined.

John McCarthy - The Man Who Named It

Dartmouth College, 1956

If Shannon built the mathematics and Turing asked the question, John McCarthy gave the whole enterprise its name - and with it, an identity.

In the summer of 1956, McCarthy organised a two-month workshop at Dartmouth College in New Hampshire. In his proposal for that workshop, he used a phrase that had never appeared before in print: Artificial Intelligence (AI).

Before Dartmouth 1956, there was no field called AI. There were researchers thinking about thinking machines, about logic and computation and automata. McCarthy put them under one roof and gave them a common name. Naming a thing is more powerful than it sounds - by calling it artificial intelligence, he was claiming that the goal was not just automation or calculation, but something that deserved the word intelligence. That framing shaped research agendas, funding decisions, and public perception for the next seventy years.

McCarthy was famously optimistic about the timeline. He reportedly believed the core problems of machine intelligence could be solved in that two-month summer. It took seventy years. But the name he chose in that proposal is the one we still use - in every headline, every job title, every government policy document, every conversation about what is happening to the world right now.

He thought it would take a summer. It took seventy years. The name, at least, was right.

Geoffrey Hinton - The Man Who Kept the Faith

Toronto → Google → everywhere, 1986–2024

Before we get into what Hinton did - a small note on what he is called. The press, the research community, and now the Nobel Committee all use the same phrase when they talk about him: The Godfather of AI. It is one of those titles that sounds like hyperbole until you understand the actual history. By the time you finish this section, you will understand why it fits.

The first three people built the theory. Geoffrey Hinton built the thing.

By the 1970s and 80s, AI research had hit a wall. The ideas were there - neural networks, machine learning, pattern recognition - but the computers weren’t powerful enough to make them work at scale. Funding dried up. Researchers moved on. The field went through what historians call the AI winters - long periods of scepticism and silence when almost everyone concluded that machine intelligence was further away than anyone had thought.

Almost everyone. Geoffrey Hinton kept going.

Hinton, a British-Canadian cognitive psychologist and computer scientist, spent decades working on neural networks - systems modelled loosely on the structure of the brain, where layers of simple units learn to recognise patterns through repeated exposure to data. The idea was not new. Frank Rosenblatt had proposed the Perceptron in 1958. But a famous 1969 book by Marvin Minsky had identified what seemed like fundamental limitations of neural nets, and most of the field abandoned them.

Hinton didn’t. Through the 80s and 90s, largely unfashionable and working with limited resources, he kept developing the algorithms - particularly backpropagation, the method by which neural networks learn from their mistakes - that would eventually make deep learning possible.

Then came the compute. GPUs, originally built for video games, turned out to be extraordinarily well-suited to the parallel calculations that neural networks require. When cheap, powerful GPU computing arrived, Hinton’s decades of theoretical work suddenly had an engine.

In 2012, his lab at the University of Toronto entered a neural network called AlexNet into the ImageNet competition - a benchmark for image recognition involving one million images across one thousand categories. AlexNet didn’t just win. It won by roughly ten percentage points. The margin was so large that it ended the debate. Deep learning was no longer a fringe idea kept alive by one stubborn researcher. It was the only idea that mattered.

Everything since - every large language model, every image generator, every voice assistant, every recommendation algorithm - descends directly from the architecture that Hinton spent thirty years keeping alive.

In 2018, Hinton shared the Turing Award with Yann LeCun and Yoshua Bengio - the three researchers who together built the deep learning foundations of modern AI. In 2024, he was awarded the Nobel Prize in Physics.

And then he left Google and started talking publicly about what he had helped create.

"I console myself with the normal excuse: if I hadn't done it, someone else would have. But I'm not sure that's true." - Geoffrey Hinton, 2023

The man who spent thirty years in the wilderness to prove that neural networks could work is now one of the most prominent voices warning about what happens when they work too well. That is not an irony. That is a conscience.

The Thread

What I find remarkable is how tight the timeline is. Shannon’s paper: 1948. Turing’s paper: 1950. McCarthy’s Dartmouth workshop: 1956. Hinton’s backpropagation work: 1986. AlexNet: 2012. ChatGPT: 2022.

The foundational ideas were all in place by 1956 - within eight years, by four people who knew each other, cited each other, built on each other. Shannon was at Dartmouth. Turing and Shannon had met. McCarthy built on both.

What took the next sixty-six years was not new ideas. It was compute catching up to the ambition of the mathematics. The ideas were always there. The machines were not yet powerful enough to run them.

Hinton provided the bridge - the decades of work that kept the neural network approach alive until the hardware finally arrived. Without him, the gap between McCarthy’s 1956 workshop and GPT-4 might have been even longer.

The mathematics: Shannon. The question: Turing. The name: McCarthy. The proof: Hinton. Everything else is built on top of those four contributions.

I use Claude every day. I now know, in a way I didn’t before I went down this rabbit hole, exactly whose work I am sitting on top of. That feels like something worth knowing.

Subscribe for free to receive new posts and support my work - or go paid for early access, deeper dives, and the occasional piece that never goes public.

The Upgrade That Almost Wasn't

Sidhartha Mandal (sidh4u) — Sat, 02 May 2026 11:03:53 GMT

It’s 04:47pm. The control plane upgrade finished twenty minutes ago and everything looks fine. The API server is responding. System pods are healthy. So you move to the node groups - because the runbook says so, and the runbook has always been right.

Then a Slack message from on-call: “ALB stopped routing to the new nodes. Users are hitting 502s.”

You stare at the screen. The nodes are Ready. The pods are Running. The target group shows healthy. And yet.

Forty minutes later, you find it: a node affinity rule, written eight months ago by someone who has since left the team, that was silently excluding the new node group from a critical deployment. The pods were running - just not the right pods, on the right nodes, behind the right load balancer.

No one wrote it down. No runbook covered it. And now you’re explaining a 40-minute degradation to your VP of Engineering at midnight.

I’ve been in that room. This article is everything I wish I’d known before I was.

The Rule Nobody Actually Follows

Control plane first. Node groups second. Never both simultaneously.

Say it out loud. It sounds obvious. And yet I have watched teams - smart teams, teams with good intentions - violate this rule constantly. Not out of carelessness, but because their automation was doing something they didn’t fully understand, or because the pressure to “just get it done” overcame the discipline to do it right.

Here’s why the rule exists. AWS guarantees that your kubelets can run up to three minor versions behind the control plane - if your nodes are on 1.25 or newer. A control plane on 1.32 can manage nodes on 1.29. That gap is not a bug - it is a deliberately engineered operating window. It exists to give you time to validate that the control plane upgrade is stable before you touch a single node. One important caveat: if your nodes are still on kubelet 1.24 or older, the older two-version skew applies. If you're running anything current, you're in the three-version world.

Use that window. Don’t race through it because you want to finish before midnight.

The teams I’ve seen upgrade with the least drama are the ones who treat the gap between control plane and node upgrade as intentional breathing room - time for their monitoring to settle, for their operators to reconnect, for them to look at the cluster and ask: does this feel right?

Before You Touch Anything: The Pre-Flight

The most expensive mistakes I’ve witnessed in EKS upgrades didn’t happen during the upgrade. They happened in the days before it, when teams skipped checks they assumed wouldn’t matter.

Check your add-on compatibility matrix

AWS managed add-ons - CoreDNS, kube-proxy, VPC CNI - have specific version compatibility matrices for each Kubernetes minor version. These are not suggestions. If your VPC CNI version is incompatible with the target Kubernetes version, new pods will fail to get network interfaces. New pods means no new capacity. No new capacity during a rolling node upgrade means workloads pile up on draining nodes and your cluster grinds to a halt.

Check this first. Every time.

aws eks describe-addon --cluster-name  --addon-name coredns
aws eks describe-addon --cluster-name  --addon-name kube-proxy
aws eks describe-addon --cluster-name  --addon-name vpc-cni

Hunt your pod disruption budgets

I once spent 90 minutes debugging a node drain that wasn’t moving. The node was cordoned. The drain command was running. Nothing was happening.

A PDB with maxUnavailable: 0 was doing exactly what it was designed to do: refusing to allow any disruption. The PDB was correct for its original purpose. But its original purpose was three months and two team members ago.

Find every PDB in your cluster before upgrade day. Review each one. Ask whether the constraint is still appropriate. Don’t find out mid-drain.

kubectl get pdb -A

Scan for deprecated API versions

Kubernetes removes APIs that were deprecated two or three versions earlier. You will not get a warning at runtime. Your workloads will simply stop deploying after the upgrade - because the API version they reference no longer exists.

Tools like Pluto or kubent will scan your cluster and flag deprecated API usage before it becomes a 2am problem. Run them. Fix what they find. Then run them again after your fixes to confirm.

Check your headroom

During a rolling node replacement, pods from draining nodes need somewhere to land. If your cluster is running at 85% utilisation with no autoscaling headroom, they have nowhere to go. The upgrade stalls. Nodes queue behind each other waiting for capacity that isn’t coming.

Temporarily bump your node group min size before the upgrade. Or confirm your cluster autoscaler has room to expand. Either way - check before you start, not after you’re stuck.

The Environment Progression: Why You Should Never Upgrade Production First

This is the one practice that has saved me more times than any technical trick.

Never upgrade production first. Always follow the progression:

POC → DEV → STAGE → PROD

Each environment in this chain serves a distinct purpose - and the discipline breaks down the moment you treat any of them as optional.

POC is a temporary environment, spun up specifically to validate the upgrade path. It doesn’t need to mirror production perfectly. What it needs to produce is a runbook. Every decision, every issue, every resolution, written down as you go. The runbook is the primary output of your POC upgrade - not the working cluster.

But there’s a second thing POC must do that most teams skip: validate every unmanaged component against the target version. This is where the real surprises live.

AWS managed add-ons - CoreDNS, kube-proxy, VPC CNI - have compatibility matrices and AWS handles their upgrade path. Your external operators and controllers have no such safety net. The AWS Load Balancer Controller, Karpenter, Cluster Autoscaler, cert-manager, external-dns - each has its own Kubernetes version compatibility matrix, maintained separately, with its own release cadence. None of them will warn you at runtime if they fall out of compatibility. They’ll just start behaving incorrectly, or stop working entirely, in ways that may not be immediately obvious.

POC is where you find this out cheaply. Install every external operator and controller your production cluster runs. Upgrade the cluster. Watch what breaks. Specifically:

Does the AWS Load Balancer Controller still reconcile ingress resources correctly after the upgrade?
Does Cluster Autoscaler still provision and register new nodes?
Does cert-manager still issue and renew certificates?
Does external-dns still sync records?
Do any of your custom operators - the ones your own team wrote - handle the new API versions correctly?

If any of these fail in POC, you have found the issue at the cheapest possible moment - in a temporary cluster, with no users, and no pressure. Document the fix. Pin the compatible version. Add it to the runbook. By the time you reach PROD, you will have validated this component three times across three environments, and its behaviour will be a known quantity.

DEV is your first real-world test. Real engineers use it. Real workload patterns emerge. I’ve seen compatibility issues surface in DEV that never appeared in POC - because real scheduling behaviour, real resource constraints, real network policy interactions only show up with real workloads. Fix the issues. Update the runbook.

STAGE is where you should have no surprises. By this point you’ve done the upgrade twice. The runbook is battle-tested. If something new surfaces in STAGE, that’s a signal - your DEV environment doesn’t match STAGE closely enough. Fix the parity, not just the symptom.

PROD is now a known quantity. The runbook is proven. The team has done this three times already. Muscle memory has replaced anxiety. The outcome is predictable - because you made it predictable.

This progression also solves a problem most teams don’t notice until it bites them: version drift between environments. When DEV is one minor version behind PROD, and STAGE is somewhere in between, your non-production upgrades tell you very little about your production upgrade. The progression - treated as a pipeline, not a one-off event - keeps everything current.

IaC Makes Upgrades Boring (That’s the Point)

The single biggest change in how I think about EKS upgrades came when I started managing clusters entirely through Terraform.

Not because Terraform is magic. Because IaC (Infrastructure as Code) forces you to be explicit about sequencing - and sequencing is where most upgrade failures live.

The terraform-aws-modules/eks module structures your cluster configuration cleanly, but it does not automatically sequence the control plane and node group upgrades for you. A plain terraform apply after bumping cluster_version will kick off both the control plane and node group upgrades in the same operation - which violates the fundamental rule. The sequencing discipline still sits with you.

The pattern that gives you back control is targeted applies. You upgrade the control plane first, validate it completely, then explicitly upgrade node groups as a separate step:

# Step 1: Bump cluster_version in your config, then:
terraform apply -target=module.eks.aws_eks_cluster.this[0]

# Validate - kubectl get nodes, check system pods, confirm stable

# Step 2: Only then, upgrade node groups
-target='module.eks.module.eks_managed_node_group["your-node-group-name"]'

This gives you the validation window the fundamental rule demands. The config change is a single version bump - clean, auditable, reviewable in a PR:

module “eks” {
  source = “terraform-aws-modules/eks/aws”
  version = “~> 21.0”

  cluster_version = “1.32” # was 1.31

  cluster_addons = {
    coredns = { most_recent = true }
    kube-proxy = { most_recent = true }
    vpc-cni = { most_recent = true }
  }
}

The apply is where you exercise the discipline - not the config. One change, two targeted applies, one validation gate in between.

One important nuance: most_recent = true is convenient for initial setup, but production clusters benefit from pinned add-on versions that you’ve explicitly validated. Use most_recent to discover the compatible version, then pin it:

coredns = {
  addon_version = “v1.11.1-eksbuild.4”
}

This prevents an add-on version from changing unexpectedly on a subsequent apply. It gives you explicit control over what changes during an upgrade - which means you can explain every change in your post-upgrade review, because you chose every change.

One thing Terraform does not handle for you: external operators. The AWS Load Balancer Controller, Cluster Autoscaler, cert-manager, external-dns, observability - these live outside the EKS managed add-on umbrella. They need to be upgraded in the same cycle, managed through their Helm chart versions in Terraform or your GitOps tooling. An EKS upgrade is not complete until every component in the cluster - managed and unmanaged - has been validated at the new version.

In-Place vs A/B: Choose Deliberately

I’ve done both. Neither is universally correct. What matters is choosing deliberately and understanding the trade-offs before you’re mid-upgrade.

In-place rolling upgrades

AWS handles the cordon, drain, replacement, and rejoin sequence within the existing node group. Simpler to orchestrate. Lower temporary cost. No DNS complexity. The trade-off: less control over the exact replacement sequence, and no parallel environment to fall back to if something goes wrong mid-upgrade.

In-place works well when your workloads are well-understood, your PDBs are accurate, and you’ve done the pre-flight thoroughly. It’s the approach most teams should start with.

A/B switching

You provision a fully operational target environment alongside the current one, validate completely, then cut over. The rollback is clean: if something is wrong, traffic stays on the original.

At the cluster level, the critical complexity is networking. Your ALB DNS names, ingress endpoints, and load balancer IPs are tied to the original cluster. A new cluster means new load balancers with new DNS names. The cutover sequence is non-negotiable:

Create all resources in the new cluster
Validate end-to-end - every service, every endpoint, every health check
Update DNS records to point to the new endpoints
Only then decommission the old cluster

Cutting DNS before validation is the most common A/B failure mode. I’ve seen it happen. Don’t let it happen to you.

A/B at the node group level

This is often the best of both worlds. Provision a new node group at the target version alongside the existing one. Cordon the old nodes, migrate workloads, validate, remove the old group. You get the rollback safety of A/B without the external DNS orchestration of a full cluster switch.

The networking consideration here moves inward. During the transition, both old and new nodes are active cluster members, and services will route to pods on either. If your workloads use node selectors, affinity rules, or topology spread constraints tied to node group labels - the midnight 502 scenario from the opening - review and update them before migration. Confirm pods are fully healthy on new nodes before removing the old group.

The Upgrade Itself

Upgrading the control plane

AWS manages the underlying upgrade - etcd, API server, controller manager, scheduler. You initiate it; they handle the mechanics. Expect 15–25 minutes.

During this window, the API server will be briefly unavailable as it cycles. Any process making continuous Kubernetes API calls - CI/CD pipelines, operators, monitoring agents - will see transient errors. Well-written operators handle this with exponential backoff and recover automatically. Know which of your operators are well-written before the upgrade, not after.

After the control plane finishes: stop. Validate. Check that all system pods are healthy, all nodes show Ready, nothing is stuck in a restart loop. No node work begins until this check passes.

kubectl get nodes
kubectl get pods -A | grep -v Running | grep -v Completed

Draining nodes

kubectl cordon 
kubectl drain  --ignore-daemonsets --delete-emptydir-data --timeout=300s

The --timeout flag is not optional. Without it, a single stuck pod can block the drain indefinitely - and you won’t know until you’ve been waiting long enough to start questioning reality. Set a timeout that’s long enough for graceful shutdown but short enough to surface genuinely stuck pods.

Upgrading add-ons - order matters

Add-on ordering has a wrinkle that catches teams out. VPC CNI is the exception to “add-ons after node groups” - if your current VPC CNI version doesn’t support the target Kubernetes API, new pods will fail to get network interfaces before the control plane upgrade even completes. Check your VPC CNI compatibility first. If it needs updating, upgrade VPC CNI before the control plane, not after.

For everything else, the ordering after node groups is:

kube-proxy first - it manages service routing rules on each node
CoreDNS last - DNS failures are visible but degrade more gracefully than networking failures

The complete sequence, when VPC CNI needs a pre-upgrade bump, looks like this: VPC CNI → control plane → node groups → kube-proxy → CoreDNS → external operators

VPC CNI handles pod networking. An incompatible version means new pods fail to get network interfaces. That’s upstream of everything else. kube-proxy manages service routing rules on each node. CoreDNS failures are visible and disruptive, but DNS can degrade gracefully in ways that networking cannot - hence the ordering.

Then: external operators. Every one of them, in the same planned window. Not as an afterthought. Not next week. An EKS upgrade is not complete until every component has been validated at the new version.

The Rollback Question (And Why It’s the Wrong Question)

Teams ask me: what’s the rollback plan for the control plane?

There isn’t one. In-place EKS upgrades cannot be rolled back. Once the control plane is upgraded, it stays upgraded. This is not a limitation to work around - it is the most important thing to communicate to your team before you begin.

The question to ask instead: what is our strategy for ensuring the upgrade succeeds?

That strategy is everything in this article. The pre-flight validation. The environment progression. The deliberate ordering. The willingness to stop and investigate rather than push through when something looks wrong.

Add-ons can be rolled back to previous versions even when the cluster version cannot. If a managed add-on update causes issues, roll it back while you investigate. The cluster version is independent.

I’ve seen teams push through warning signs because they were “almost done” and didn’t want to restart the window. That instinct - understandable, human, wrong - is what turns planned maintenance into incidents.

Stop when something looks wrong. Investigate. Decide deliberately. The cluster will wait.

What Separates Teams That Upgrade Confidently

After years of this, I’ve noticed the same patterns in teams that upgrade without drama - and in teams that dread every upgrade cycle.

Automation with gates. Not “run everything and hope” - but scripted upgrade sequences with explicit validation checks between each step. The gate is the discipline. Without it, automation just makes mistakes faster.

Upgrade frequency. Clusters upgraded every minor version are dramatically easier to manage than clusters that skip versions. The diff is smaller. The compatibility surface is narrower. The team stays familiar with the process. EKS supports each minor version for 14 months of standard support, followed by 12 months of extended support - 26 months total if you opt in. A new minor version arrives roughly every four months. That math still produces three or four upgrade cycles per year, but teams on extended support may feel less urgency than they should. Extended support is a cost - both financially and operationally. The longer you defer, the larger the diff, the wider the compatibility surface, and the harder the eventual upgrade. Treat 14 months as your real target window, not 26.

Pre-production parity. If your staging cluster doesn’t resemble production in workload type, node configuration, and add-on versions - your staging upgrade tells you almost nothing about your production upgrade. Parity is the whole point of the progression.

Living runbooks. A written runbook for your specific cluster configuration - updated after every upgrade cycle - is worth more than any generic guide, including this one. It’s what you reach for when something unexpected happens at 11pm. It’s what turns “we’ve done this before” from a feeling into a fact.

Boring Upgrades Are Good Upgrades

The opening story - the 40-minute degradation, the midnight Slack message, the forgotten node affinity rule - didn’t have to happen. It happened because the upgrade process had never been fully documented, because the environment progression was skipped to save time, because the pre-flight checks didn’t include workload-level configuration review.

None of those failures were dramatic. None of them were unavoidable. They were the accumulated cost of treating upgrades as events rather than operations.

The teams I respect most in this space don’t talk about their EKS upgrades. Not because the upgrades are secret - because they’re unremarkable. Planned maintenance window. Environment progression. Proven runbook. Boring outcome.

That’s the goal. Not heroic upgrades. Boring ones.

If this helped you think through your next upgrade, share it with whoever runs your cluster. And if you have a war story of your own - a PDB that blocked you, a node affinity rule that bit you, an add-on that fell out of compatibility mid-window - I’d genuinely like to hear it.

Subscribe for free to receive new posts and support my work - or go paid for early access, deeper dives, and the occasional piece that never goes public.

The .pem File in the Slack Channel

Sidhartha Mandal (sidh4u) — Thu, 16 Apr 2026 16:39:14 GMT

I have seen this exact sequence play out at three different companies.

A new engineer joins. On their first day, someone drops a .pem file into Slack. “Here’s the key to the dev server. Don’t share it.” Everyone laughs a little because everyone already has the same key. The engineer downloads it, tucks it into their ~/Downloads folder, and promptly forgets about it.

Six months later, that engineer moves on. The offboarding checklist says “revoke access” - but no one quite knows what that means for a shared key. The ticket stays open. The .pem file stays valid. Somewhere on a MacBook that’s been passed to someone else, it still exists.

Two years later, a security audit asks: who currently has access to your production instances? The honest answer is: everyone who has ever worked here, and possibly their personal laptops.

SSH security is not a configuration problem. It is an architecture problem. And most teams are solving it at the wrong level.

This post is a maturity model - three layers of SSH security, each appropriate for a different scale and risk profile. You do not have to reach Level 3 to be secure. But you should consciously know which level you are at, and why.

First: Know What You Are Protecting

Before choosing your approach, be honest about your context. The threat model changes completely depending on where your servers live.

A VPS with a public IP - DigitalOcean, Linode, a single EC2 instance with port 22 open to the internet. The attack surface is the server itself. Harden it.

A VPC with private subnets - AWS, GCP, Azure. Your instances are not directly reachable from the internet. You need a controlled entry point. This is the bastion host model.

A VPC at scale with many engineers and many instances - shared .pem files become operationally unmanageable and auditably indefensible. You need SSH certificates with short-lived, signed access. This is zero-trust SSH.

Most teams operate at Level 1 regardless of which context they are actually in. Let us fix that.

Level 1 - Hardening a Public-Facing Server

If you have a VPS or a single EC2 instance with SSH exposed to the internet, start here. The goal is to reduce the attack surface of the server itself, layer by layer.

a. Kill Root Login. Create a Named User.

There is no legitimate reason for direct root SSH access in 2026. Every action taken as root is unattributable and unauditable. Create a named user, give it sudo privileges, then lock root out entirely.

# Create the user
adduser 

# Grant sudo privileges
usermod -aG sudo 

# Then in /etc/ssh/sshd_config:
PermitRootLogin no

# Restart SSH after every sshd_config change
systemctl restart

The -aG flag is important - ‘a’ appends the user to the group rather than replacing their existing group memberships. Without it, you can silently strip a user of other group access.

b. Keys Over Passwords - and Choose Your Algorithm

Passwords are brute-forceable. SSH keys are not. But the algorithm you choose matters.

Ed25519 is the right choice today. It is faster, shorter, and cryptographically stronger than RSA. Unless you have a specific compatibility constraint with legacy systems, use it:

ssh-keygen -t ed25519 -C “user@domain.com”

If you are on a system that does not support Ed25519, RSA with 4096 bits is the fallback. Avoid DSA entirely - it is limited to 1024-bit keys by the standard, which is considered broken. Avoid ECDSA unless you fully understand the curve parameters and trust them.

# Fallback only - prefer Ed25519
ssh-keygen -t rsa -b 4096 -C “user@domain.com”

# Copy the public key to the server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@remote-host

# Then disable password auth in /etc/ssh/sshd_config
PasswordAuthentication no
GSSAPIAuthentication no

c. File Permissions - SSH Is Strict and Rightfully So

SSH will silently refuse to use your keys if permissions are wrong. This catches more people out than it should.

# On your workstation
chmod 700 ~/.ssh
chmod 600 ~/.ssh/id_ed25519.       # private key - only you
chmod 644 ~/.ssh/id_ed25519.pub    # public key - readable

# On the server
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys

The private key at 600 is non-negotiable. If it is world-readable, SSH refuses to use it. The error message - “WARNING: UNPROTECTED PRIVATE KEY FILE” - is SSH doing you a favour.

d. Rate-Limit the Knock with fail2ban

Port 22 on a public IP will be knocked on constantly - automated scanners, credential-stuffing bots, slow-and-low probes. fail2ban watches the auth logs and bans IPs that exceed a failure threshold. The defaults are too lenient. Tune them:

# /etc/fail2ban/jail.local
[sshd]
enabled = true
banaction = iptables-multiport
maxretry = 3
findtime = 1d
bantime = 4w

# Add to Boot and start the daemon service
systemctl enable fail2ban && systemctl start fail2ban

# Check who is currently banned
fail2ban-client status sshd

Three failed attempts in a day earns a four-week ban. That stops brute-force and slow-and-low attacks alike. The key insight: an attacker who gets banned on the third attempt will move on. There are easier targets.

e. Two-Factor Auth - Because Keys Can Be Stolen Too

A compromised laptop means a compromised private key. Adding a second factor means an attacker needs your private key AND your phone. For any server with a public IP, this friction is worth it.

# Install the package
apt install libpam-google-authenticator

# Run it
google-authenticator

# Answer: y y y n y to the prompts
# Scan the QR code with your authenticator app

# Add to the TOP of /etc/pam.d/sshd:
auth required pam_google_authenticator.so

# In /etc/ssh/sshd_config:
ChallengeResponseAuthentication yes
AuthenticationMethods publickey,keyboard-interactive

After this, login requires your private key first, then the OTP. Both must succeed. An attacker with your key but not your phone goes nowhere.

The principle underlying all of Level 1 is layered defence. Each measure is independent - a failure in one does not collapse the others. Root login disabled, key-only auth, rate limiting, and 2FA are four separate gates. An attacker has to defeat all of them, not just one.

Level 2 - Bastion Host Architecture

The day you move your application servers into a VPC with private subnets, the threat model shifts. Your instances should not be reachable from the internet at all - not even on port 22. The bastion host becomes the single controlled entry point into the private network.

[ Internet ] → [ Bastion - public subnet ] → [ Private instances - private subnet ]

The bastion itself should be hardened with everything in Level 1. The key architectural decisions beyond that are about how keys flow - or more precisely, about ensuring they don’t flow to the wrong place.

Never Store the Private Key on the Bastion

This is the most common Level 2 mistake I see. The bastion is a proxy, not a key store. If it is compromised and your .pem is sitting on it, every instance it can reach is now compromised too.

Use SSH agent forwarding instead. Your workstation holds the key. The agent handles authentication end-to-end. The private key never leaves your machine/laptop - it is used to sign the authentication challenge locally, and only the signature travels over the wire.

# ~/.ssh/config

# Bastion entry point
Host bastion-vpc1
  Hostname 
  User ec2-user
  IdentityFile ~/.ssh/id_ed25519
  ForwardAgent yes

# Route any private IP in VPC-1 through the bastion transparently
Host 10.1.*.*
  User ec2-user
  IdentityFile ~/.ssh/id_ed25519
  ProxyCommand ssh bastion-vpc1 -W %h:%p

# Sensible defaults - keep this at the END of ~/.ssh/config
Host *
  ServerAliveInterval 30
  ServerAliveCountMax 2
  StrictHostKeyChecking accept-new

With this config, ssh 10.1.1.45 routes through the bastion automatically. Engineers type a single command. The hop is invisible to them.

One note on StrictHostKeyChecking: the original version of this article used StrictHostKeyChecking no paired with UserKnownHostsFile /dev/null. Do not do this. Those two settings together disable host key verification entirely and discard all verification state - which opens the door to man-in-the-middle attacks. accept-new is the safe default: it automatically trusts new hosts on first connection, but rejects any subsequent change to a known host’s key. That is the behaviour you actually want.

The gap Level 2 does not close: the shared key problem. Everyone on the team uses the same key-pair. When someone leaves, you either rotate across every instance or accept the risk that they still technically have access. Most teams accept the risk. Most teams should not.

Level 3 - Signed SSH Certificates: Zero-Trust at Scale

I want you to think about what trust actually means in the key-based model. You add a public key to authorized_keys on a server. That key now has access to that server indefinitely - until someone manually removes it. There is no expiry. No central revocation. No audit trail of who used it when.

Now multiply that across fifty engineers and two hundred instances. You have a trust graph that nobody fully understands, that grows with every hire and never fully shrinks with every departure.

SSH certificates solve this by introducing a Certificate Authority. Instead of distributing keys to servers, you issue short-lived signed certificates. Access expires automatically. Trust is centralized.

The architecture has three roles: a CA server that holds the signing keys and issues certificates, target servers that trust the CA rather than individual keys, and engineers who present a signed certificate to authenticate.

Setting Up the CA

On your CA server, generate two signing key-pairs - one for hosts, one for users. Keep these keys offline or in a secure secrets manager. They are the root of trust for your entire fleet.

# Host CA - proves the server is legitimate (prevents MITM)
ssh-keygen -t ed25519 -N ‘’ -C HOST_CA -f /etc/ssh/ca/host_ca

# User CA - proves the engineer is legitimate
ssh-keygen -t ed25519 -N ‘’ -C USER_CA -f /etc/ssh/ca/user_ca

Signing a Host Certificate

Each server gets a signed host certificate. This is the part most teams skip - and it is important. Without it, engineers are still vulnerable to MITM attacks on their first connection to a new host. With it, the SSH client can cryptographically verify the server’s identity without relying on the trust-on-first-use model.

# Create Host Certificate
ssh-keygen -s /etc/ssh/ca/host_ca \
-I host_server01 \
-h \
-n server01.internal \
-V +52w \
/etc/ssh/ssh_host_ed25519_key.pub

# Reference in /etc/ssh/sshd_config on the server:
HostCertificate /etc/ssh/ssh_host_ed25519_key-cert.pub

Signing a User Certificate

An engineer submits their public key. The CA signs it with a short validity window and an explicit list of allowed usernames - called principals. The certificate expires. The engineer has to come back for a new one.

# Create User Certificate
ssh-keygen -s /etc/ssh/ca/user_ca \
-I _laptop \
-n ec2-user,ubuntu \
-V +16h \
~/.ssh/id_ed25519.pub

# Produces: id_ed25519-cert.pub
# Engineer copies it to ~/.ssh/ - SSH picks it up automatically

Note on validity: the right TTL for user certificates in a zero-trust model is hours, not weeks. +16h to +24h is the production standard - long enough for a working day, short enough that a stolen certificate has a very narrow window. A 5-week certificate is not short-lived; it is just a key with a distant expiry date.

On every server, configure sshd to trust the user CA instead of managing authorized_keys:

# /etc/ssh/sshd_config
TrustedUserCAKeys /etc/ssh/ca/user_ca.pub

You can inspect any certificate to verify its principals and expiry:

ssh-keygen -L -f ~/.ssh/id_ed25519-cert.pub
# Output shows:
# Valid: from 2025-05-06T08:00:00 to 2025-05-07T00:00:00
# Principals: ec2-user, ubuntu
# Key ID: “alice_laptop”

What This Changes Operationally

Onboarding: engineer submits public key → CA signs it → done. No touching authorized_keys on any server. No Slack messages with .pem files.

Offboarding: the certificate expires on its own - within 24 hours if you are using short TTLs. For immediate revocation, add the certificate serial to a Key Revocation List (KRL) and distribute it. One operation, fleet-wide effect.

Audit: every certificate carries an identity (the key ID) and a serial number. Your access logs now show not just which IP connected, but which engineer and from which machine.

Rotation: when you rotate the CA key, you issue new certificates to all hosts and users. One change, everything re-issues naturally. Compare this to rotating a shared .pem across two hundred instances.

The Automation Gap

Signing certificates manually works for a team of five. It does not work for a team of fifty. HashiCorp Vault’s SSH secrets engine handles this natively - engineers request a certificate via Vault, it is signed and returned with a configured TTL, and Vault maintains the full audit log. Engineers never see the CA private key. That is the production-grade implementation of this model, and the right destination for any team with compliance requirements.

Choosing Your Level

Single VPS with public IP → Level 1. Harden the server. Layers of independent controls.

Small team, VPC, stable headcount → Level 2. Bastion host, SSH config, agent forwarding. No keys on the bastion.

Growing team, frequent joiners and leavers → Level 3. Signed certificates with short TTLs. Centralized trust.

Compliance requirements (SOC2, ISO 27001) → Level 3, non-negotiable. The audit trail is the requirement.

Multi-cloud with many instances → Level 3. Shared keys do not scale operationally or auditably.

The levels are cumulative, not alternatives. Level 3 still uses a hardened bastion. Level 2 still applies the server hardening from Level 1. Each layer builds on the one below.

The .pem File Is Still Out There

The Slack message from the opening of this post - “Here’s the key, don’t share it” - is not a security failure by any individual. It is a systems failure. When your process makes the insecure thing the easy thing, the insecure thing is what happens.

SSH certificates make the secure thing the easy thing. Onboarding is a single signing operation. Offboarding is automatic. The audit trail is built in. The blast radius of a compromised credential is bounded by its TTL.

Most teams wait for a security incident, an audit finding, or an ex-employee’s name appearing in an access log before they rethink their SSH model. The better time is before any of those things happen.

The .pem file in the Slack channel is not a starting point. It is a liability with a countdown.

If this helped you think through where your team sits on this model, share it with whoever owns your security posture. And if you have already made the journey to signed certificates - I’d genuinely like to hear what your team’s implementation looks like.

Subscribe for free to receive new posts and support my work - or go paid for early access, deeper dives, and the occasional piece that never goes public..