Ship It Weekly - DevOps, SRE, Platform and Cloud Engineering News

McKinsey AI Flaw, Kafka Goes Diskless, Google Buys Wiz, AWS Copilot Ends, and AI Gateway on Kubernetes

#27

Yesterday at 9:18 PM

This week on Ship It Weekly, Brian looks at what happens when new interfaces create old responsibilities.

McKinsey patched a vulnerability in its internal AI tool Lilli, Kafka contributors are pushing a diskless-topics model that rethinks durability and replication in cloud environments, and Google officially closed Wiz acquisition in one of the biggest cloud-security moves. Plus: AWS is sunsetting Copilot CLI, Kubernetes launches an AI Gateway Working Group.

Links

McKinsey statement on Lilli

https://www.mckinsey.com/about-us/media/statement-on-strengthening-safeguards-within-the-lilli-tool

Kafka diskless topics proposal

https://cwiki.apache...

Meta Buys Moltbook, Block AI Layoffs Get Messier, Atlassian Cuts Jobs, and GitHub Explains the Outages

#26

03/13/2026

This week on Ship It Weekly, Brian covers five “AI meets reality” stories that every DevOps, SRE, security, and platform team can learn from.

Block’s AI layoff story is getting messier as follow-up reporting pushes back on the original framing, Meta bought Moltbook and brought more attention to the trust and security problems already showing up around AI-agent platforms, and Atlassian cut about 10% of its workforce while saying AI is changing the skills and roles it needs. Plus: GitHub gives one of the more honest outage breakdowns we’ve seen lately, Anthropic and Mozilla show a more gro...

Ship It Conversations: Yvonne Young on Linux Foundations, Mentorship, and Getting Job Ready in Cloud

#25

03/09/2026

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).

In this Ship It: Conversations episode I talk with Yvonne Young, a cloud and Linux mentor active in the CloudWhistler community. We talk about the real path into cloud and DevOps, why Linux still matters as a foundation, what “job ready” actually means, and why focus, consistency, and business thinking matter more than chasing every new tool.

Highlights

Linux fundamentals still matter because so much of cloud and infra work sits on top of LinuxWhat “job ready” really means: p...

AWS Bahrain/UAE Data Center Issues Amid Iran Strikes, ArgoCD vs Flux GitOps Failures, GitHub Actions Hackerbot-Claw Attacks (Trivy), RoguePilot Codespaces Prompt Injection, Block “AI Remake” Layoffs, Claude Code Security

#24

03/07/2026

This week on Ship It Weekly, Brian looks at how the boundary of ops keeps expanding.

We cover AWS flagging issues in Bahrain/UAE amid Iran strikes, ArgoCD vs Flux and why ArgoCD can get stuck in failed sync states, GitHub Actions being exploited at scale (plus Trivy’s incident), RoguePilot prompt injection meeting real credentials in Codespaces, Block’s “AI remake” layoffs, and Anthropic’s Claude Code Security for defenders.

Lightning round: DeepSeek model access geopolitics, Vercel’s agentic security boundaries, a KEV CVE to patch, an MCP-atlassian SSRF-to-RCE chain, and Claude Cowork scheduled tasks.

Cloudflare BYOIP BGP Withdrawals, Clerk’s Postgres Query-Plan Flip Outage, and AWS Kiro Permissions Lessons (Grafana Privesc + runc CVEs)

#23

02/27/2026

This week on Ship It Weekly, Brian looks at how the boundary of ops keeps expanding.

We cover AWS flagging issues in Bahrain/UAE amid Iran strikes, ArgoCD vs Flux and why ArgoCD can get stuck in failed sync states, GitHub Actions being exploited at scale (plus Trivy’s incident), RoguePilot prompt injection meeting real credentials in Codespaces, Block’s “AI remake” layoffs, and Anthropic’s Claude Code Security for defenders.

Lightning round: DeepSeek model access geopolitics, Vercel’s agentic security boundaries, a KEV CVE to patch, an MCP-atlassian SSRF-to-RCE chain, and Claude Cowork scheduled tasks.

Ship It Conversations: Mike Lady on Day Two Readiness + Guardrails in the AI Era

#22

02/24/2026

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).

In this Ship It: Conversations episode I talk with Mike Lady (Senior DevOps Engineer, distributed systems) from Enterprise Vibe Code on YouTube. We talk day two readiness, guardrails/quality gates, and why shipping safely matters even more now that AI can generate code fast.

Highlights

Day 0 vs Day 1 vs Day 2 (launching vs operating and evolving safely)What teams look like without guardrails (“hope is not a strategy”)Why guardrails speed you up long-term (less firefighting, more predictable delivery)Day...

Ship It Weekly – DevOps and SRE News for Engineers Who Run Production

02/22/2026

Ship It Weekly is a DevOps and SRE news podcast for engineers who run real systems.

Every week I break down what actually matters in cloud, Kubernetes, CI/CD, infrastructure as code, and production reliability. No hype. No vendor spin. Just practical analysis from someone who’s been on call and shipped systems at scale.

This isn’t a tutorial show. It’s a signal filter.

I cover major industry shifts, security incidents, cloud provider changes, and tooling updates, then explain what they mean for platform teams and engineers operating in production.

If y...

GitHub Agentic Workflows, Gentoo Leaves GitHub, Argo CD 3.3 Upgrade Gotcha, AWS Config Scope Creep

#21

02/20/2026

This week on Ship It Weekly, Brian hits five stories where the “defaults” are shifting under ops teams.

GitHub is bringing Agentic Workflows into Actions, Gentoo is migrating off GitHub to Codeberg, Argo CD upgrades are forcing Server-Side Apply in some paths, AWS Config quietly expanded coverage again, and EC2 nested virtualization is now possible on virtual instances.

Links

YouTube episodes https://www.youtube.com/watch?v=tuuLlo2rbI0&list=PLYLi5KINFnO7dVMbhsJQTKRFXfSSwPmuL&pp=sAgC

OnCallBrief https://oncallbrief.com

Teller’s Tech Substack https://tellerstech.substack.com/

GitHub...

Special: OpenClaw Security Timeline and Fallout: CVE-2026-25253 One-Click Token Leak, Malicious ClawHub Skills, Exposed Agent Control Panels, and Why Local AI Agents Are a New DevOps/SRE Control Plane (OpenAI Hires Founder)

#20

02/17/2026

In this Ship It Weekly special, Brian breaks down the OpenClaw situation and why it’s bigger than “another CVE.”

OpenClaw is a preview of what platform teams are about to deal with: autonomous agents running locally, wired into real tools, real APIs, and real credentials. When the trust model breaks, it’s not just data exposure. It’s an operator compromise.

We walk through the recent timeline: mass internet exposure of OpenClaw control panels, CVE-2026-25253 (a one-click token leak that can turn your browser into the bridge to your local gateway), a skills marketplac...

When guardrails break prod: GitHub “Too Many Requests” from legacy defenses, Kubernetes nodes/proxy GET RCE, HCP Vault resilience in an AWS regional outage, and PCI DSS scope creep

#19

02/13/2026

This week on Ship It Weekly, Brian hits four stories where the guardrails become the incident.

GitHub had “Too Many Requests” caused by legacy abuse protections that outlived their moment. Takeaway: controls need owners, visibility, and a retirement plan.

Kubernetes has a nasty edge case where nodes/proxy GET can turn into command execution via WebSocket behavior. If you’ve ever handed out “telemetry” RBAC broadly, go audit it.

HashiCorp shared how HCP Vault handled a real AWS regional disruption: control plane wobbled, Dedicated data planes kept serving. Control plane vs data plane separation...

Azure VM Control Plane Outage, GitHub Agent HQ (Claude + Codex), Claude Opus 4.6, Gemini CLI, MCP

#18

02/06/2026

This week on Ship It Weekly, Brian hits four “control plane + trust boundary” stories where the glue layer becomes the incident.

Azure had a platform incident that impacted VM management operations across multiple regions. Your app can be up, but ops is degraded.

GitHub is pushing Agent HQ (Claude + Codex in the repo/CI flow), and Actions added a case() function so workflow logic is less brittle.

MCP is becoming platform plumbing: Miro launched an MCP server and Kong launched an MCP Registry.

Links

Azure status incident (VM service mana...

CodeBreach in AWS CodeBuild, Bazel TLS Certificate Expiry Breaks Builds, Helm Charts Reliability Audit, and New n8n Sandbox Escape RCE

#17

01/30/2026

This week on Ship It Weekly, Brian looks at four “glue failures” that can turn into real outages and real security risk.

We start with CodeBreach: AWS disclosed a CodeBuild webhook filter misconfig in a small set of AWS-managed repos. The takeaway is simple: CI trigger logic is part of your security boundary now.

Next is the Bazel TLS cert expiry incident. Cert failures are a binary cliff, and “auto renew” is only one link in the chain.

Third is Helm chart reliability. Prequel reviewed 105 charts and found a lot of demo-friendly defaults that don...

Ship It Conversations: AI Automation for SMBs: What to Automate (And What Not To) (with Austin Reed)

#16

01/27/2026

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).

In this Ship It: Conversations episode I talk with Austin Reed from horizon.dev about AI and automation for small and mid-sized businesses, and what actually works once you leave the demo world.

We get into the most common automation wins he sees (sales and customer service), why a lot of projects fail due to communication and unclear specs more than the tech, and the trap of thinking “AI makes it cheap.” Austin shares how they push teams toward quic...

curl Shuts Down Bug Bounties Due to AI Slop, AWS RDS Blue/Green Cuts Switchover Downtime to ~5 Seconds, and Amazon ECR Adds Cross-Repository Layer Sharing

#15

01/24/2026

This week on Ship It Weekly, Brian looks at three different versions of the same problem: systems are getting faster, but human attention is still the bottleneck.

We start with curl shutting down their bug bounty program after getting flooded with low-quality “AI slop” reports. It’s not a “security vs maintainers” story, it’s an incentives and signal-to-noise story. When the cost to generate reports goes to zero, you basically DoS the people doing triage.

Next, AWS improved RDS Blue/Green Deployments to cut writer switchover downtime to typically ~5 seconds or less (single-region). That’s a big deal...

n8n Auth RCE (CVE-2026-21877), GitHub Artifact Permissions, and AWS DevOps Agent Lessons

#14

01/16/2026

This week on Ship It Weekly, the theme is simple: the automation layer has become a control plane, and that changes how you should think about risk.

We start with n8n’s latest critical vulnerability, CVE-2026-21877. This one is different from the unauth “Ni8mare” issue we covered in Episode 12. It’s authenticated RCE, which means the real question isn’t only “is it internet exposed,” it’s who can log in, who can create or modify workflows, and what those workflows can reach. Takeaway: treat workflow automation tools like CI systems. They run code, they hold creden...

Ship It Conversations: Human-in-the-Loop Fixer Bots and AI Guardrails in CI/CD (with Gracious James)

#13

01/12/2026

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).

In this Ship It: Conversations episode I talk with Gracious James Eluvathingal about TARS, his “human-in-the-loop” fixer bot wired into CI/CD.

We get into why he built it in the first place, how he stitches together n8n, GitHub, SSH, and guardrailed commands, and what it actually looks like when an AI agent helps with incident response without being allowed to nuke prod. We also dig into rollback phases, where humans stay in the loop, and why validating ever...

n8n Critical CVE (CVE-2026-21858), AWS GPU Capacity Blocks Price Hike, Netflix Temporal

#12

01/09/2026

This week on Ship It Weekly, Brian’s theme is basically: the “automation layer” is not a side tool anymore. It’s part of your perimeter, part of your reliability story, and sometimes part of your budget problem too.

We start with the n8n security issue. A lot of teams use n8n as glue for ops workflows, which means it tends to collect credentials and touch real systems. When something like this drops, the right move is to treat it like production-adjacent infra: patch fast, restrict exposure, and assume anything stored in the tool is high val...

Ship It Conversations: Backstage vs Internal IDPs, and Why DevEx Muscle Matters (with Danny Teller)

#11

01/06/2026

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).

I sat down with Danny Teller, a DevOps Architect and Tech Lead Manager at Tipalti, to talk about internal developer platforms and the reality behind “just set up a developer portal.” We get into Backstage versus internal IDPs, why adoption is the real battle, and why platform/DevEx maturity matters more than whatever tool you pick.

What we covered

Backstage vs internal IDPs Backstage is a solid starting point for a developer portal, but it doesn’t magica...

Fail Small, IaC Control Planes, and Automated RCA

#10

01/03/2026

This week on Ship It Weekly, Brian kicks off the new year with one theme: automation is getting faster, and that makes blast radius and oversight matter more than ever.

We start with Cloudflare’s “fail small” mindset. The core idea is simple: big outages usually come from correlated failure, not one box dying. If a bad change lands everywhere at once, you’re toast. “Fail small” is about forcing problems to stay local so you can stop the bleeding before it becomes global.

Next is Pulumi’s push to be the control plane for all your IaC...

Ship It Conversations: From Full-Stack to Cloud/DevOps, One Project at a Time (with Eric Paatey)

#9

12/30/2025

This is a guest conversation episode of Ship It Weekly (separate from the weekly news recaps).

I sat down with Eric Paatey, a Cloud & DevOps Engineer who’s been transitioning from full-stack web development into cloud/devops, and building real skills through hands-on projects instead of just collecting tools and buzzwords.

We talk about what that transition actually feels like, what’s helped most, and why you don’t need a rack of servers to learn DevOps.

What we covered Eric’s path into DevOps How he moved from building web apps to caring a...

Cloudflare’s Workers Scheduler, AWS DBs on Vercel, and JIT Admin Access

#8

12/27/2025

This week on Ship It Weekly, Brian looks at real platform engineering in the wild.

We start with Cloudflare’s write-up on building an internal maintenance scheduler on Workers. It’s not marketing fluff. It’s “we hit memory limits, changed the model, and stopped pulling giant datasets into the runtime.”

Next up: AWS databases are now available inside the Vercel Marketplace. This is a quiet shift with loud consequences. Devs can click-button real AWS databases from the same place they deploy apps, and platform teams still own the guardrails: account sprawl, billing/tagging, audit trails, re...

Ship It Conversations: The WHY Behind DevOps, Upskilling, and Agentic AI (with Maz Islam)

#7

12/21/2025

This is a Ship It Weekly conversation episode. The weekly news recaps are still weekly. These interviews drop in between when I find someone worth talking to and the convo feels useful.

In this episode I’m joined by Mazharul “Maz” Islam (DevOps with Maz). Maz is a UK-based DevOps Engineer who shares practical, real-world DevOps content on YouTube and LinkedIn. We talk about the stuff that actually matters when you’re building systems, running infrastructure, owning reliability, and living in on-call.

We hit three big things: the importance of understanding the WHY behind DevOps (not just...

GitHub Runner Pricing Pause, Terraform Cloud Limits, and AI in CI

#6

12/20/2025

This week on Ship It Weekly, Brian looks at how the “platform tax” is showing up everywhere: pricing model shifts, CI dependencies, and new security boundaries thanks to AI agents.

We start with GitHub Actions. GitHub announced a new “cloud platform” charge for self-hosted runners in private/internal repos… then hit pause after backlash. Hosted runner price reductions for 2026 are still planned. We also got the perfect timing joke: a GitHub incident the same week.

Next up is HashiCorp. Legacy HCP Terraform (Terraform Cloud) Free is reaching end-of-life in 2026, with orgs moving to the newer Free tier...

IBM Buys Confluent, React2Shell, and Netflix on Aurora

#5

12/12/2025

In this episode of Ship It Weekly, Brian powers through a cold and digs into a very “infra grown-up” week in DevOps.

First up, IBM is buying Confluent for $11B. We talk about what that means if you’re on Confluent Cloud today, still running your own Kafka, or trying to choose between Confluent, MSK, and DIY. It’s part of a bigger pattern after IBM’s HashiCorp deal, and it has real implications for vendor concentration and “plan B” strategies.

Then we shift to React2Shell, a 10.0 RCE in React Server Components that’s already being ex...

AWS re:Invent for Platform Teams, GKE at 130k Nodes, and Killing Staging

#4

12/04/2025

In this episode of Ship It Weekly, Brian looks at re:Invent through a platform/SRE lens and pulls out the updates that actually change how you design and run systems.

We talk about regional NAT Gateways and Route 53 Global Resolver on the networking side, ECS Express Mode and EKS Capabilities as new paved roads for app teams, S3 Vectors GA and 50 TB S3 objects for AI and data lakes, Aurora PostgreSQL dynamic data masking, CodeCommit’s return to full GA, and IAM Policy Autopilot for AI-assisted IAM policies. This was recorded mid–re:Invent, so consider it a...

Kubernetes Config Reality Check, EKS Control Planes, and GitHub Guardrails

#3

11/26/2025

In this episode of Ship It Weekly, Brian digs into what’s new for people actually running infra: Kubernetes config, EKS control planes and networking, and GitHub’s latest CI/CD and Copilot updates.

We start with Kubernetes’ new configuration good practices post and how to turn it into a checklist to clean up Helm/Kustomize and kill off “hotfix from my laptop” manifests.

Then we hit AWS: EKS Provisioned Control Plane to size control plane capacity for big or noisy clusters, plus new network observability so you can see who’s talking to what across clust...

Kubernetes Config Reality Check, EKS Control Planes, and GitHub Guardrails

#3

11/26/2025

In this episode of Ship It Weekly, Brian digs into what’s new for people actually running infra: Kubernetes config, EKS control planes and networking, and GitHub’s latest CI/CD and Copilot updates.

We start with k8s’ new configuration good practices post and how to turn it into a checklist to clean up Helm/Kustomize and kill off “hotfix from my laptop” manifests.

Then we hit AWS: EKS Provisioned Control Plane to size control plane capacity for big or noisy clusters, plus new network observability so you can see who’s talking to what across...

Kubernetes Shake-ups, Platform Reality, and AI-Native SRE

#2

11/21/2025

In this episode of Ship It Weekly, Brian digs into 3 big themes for anyone running Kubernetes or building internal platforms.

First, Kubernetes is officially retiring Ingress NGINX and moving it into best-effort maintenance until March 2026. We talk about what that actually means if you’re still using it and how to think about choosing and rolling out a replacement ingress.

Second, we look at how CNCF is defining platform engineering and what “platform as a product” looks like in practice, plus some hard-earned lessons from running Kubernetes in production.

Third, we talk about AI as...

Special: When the Cloud Has a Bad Day: Cloudflare, AWS us-east-1 & GitHub Outages

#1

11/20/2025

In this special kickoff episode of Ship It Weekly, Brian walks through three major outages from the last few weeks and what they actually mean for DevOps, SRE, and platform teams.

Instead of just reading status pages, we look at how each incident exposes assumptions in our own architectures and runbooks:

Topics in this episode:

• Cloudflare’s global outage and what happens when your CDN/WAF becomes a single point of failure

• The AWS us-east-1 incident and why “multi-AZ in one region” isn’t a full disaster recovery strategy

• GitHub’s Git o...