/ posts

# Surviving a 100% CPU Database Meltdown in Open WebUI - Fixing a Hidden Full Table Scan

2026-07-06 4 min read

We host a beta version of an open-source tool called Open WebUI at an enterprise scale. We have over 1,000 daily users, hitting 500+ concurrent users at peak times.

SRE

Read

# Automating Kubernetes Observability: Scaling Your Metrics with Dynamic Discovery

2026-06-15 3 min read

Let’s say you have a kubernetes cluster and prometheus with multiple workloads running on it. You want to monitor the health of the cluster and the workloads.

SRE Infrastructure Observability

Read

# Anatomy of a 3-Hour Outage: How a Single Redis Config DDoS’d Our Own Production

2026-06-05 4 min read

Picture this: I’m sitting in a room packed with the infrastructure team, the vendor, and our developers. Tension is high. We had just gone through a platform bridge change that caused IPs to cycle.…

SRE Infrastructure

Read

# Why AI Doesn't Make Software Delivery Faster: The Order-Taker Trap in Non-Tech Companies

2026-05-22 3 min read

There is a stark difference between working at a company that builds technology and a company that merely uses it. In organizations where technology is not the core business, IT and engineering…

Tech Debt Engineering Culture AI

Read

# Escaping the Dashboard Trap: Reverse-Engineering Superset to Build a Golden Path

2026-05-20 4 min read

Once upon a time, my engineering team was stuck in a vicious cycle.

Software Architecture Platform Engineering

Read

# The Hidden Cost of "Cheap" Architecture: Why an OutSystems Hack Caused a Week-Long Outage

2026-05-20 5 min read

I recently dealt with an incident that took a development environment down for an entire week.

Tech Debt Engineering Culture

Read

Older Posts