Black Friday last year. 9:47am. Traffic starts climbing. By 10:15am it's 12x our normal baseline. At 10:31am my phone buzzes: "P1 Alert: database connection pool exhaustion detected. Threshold exceeded for 45 seconds."

By 10:35am we'd identified the cause, pushed a configuration change, and the metrics started recovering. Total user impact: elevated latency for about 600 users over four minutes. Nobody got an error page. Support got zero tickets about it.

Here's the setup that made that possible.

The Stack

We run Prometheus for metrics collection, Grafana for dashboards and alerting, Loki for log aggregation, and PagerDuty for on-call escalation. All of this runs in our Kubernetes cluster on AWS EKS. Setup time for someone new to this stack is roughly two days to get the basics working and another week to get your alerts tuned properly.

The Alerts That Actually Matter

After two years of tuning, these are the alerts we actually care about:

  • Database connection pool utilisation > 70% for 60 seconds — This caught the Black Friday issue. By the time you hit 100%, the database is rejecting connections and users are seeing errors. 70% is the warning point where you have time to act.
  • p99 API response time > 2 seconds for 90 seconds — Not p50 or p95. p99 catches the tail latency issues that affect your worst-case users and often indicate something deeper than just a slow query.
  • Pod OOMKilled events in the last 5 minutes — Kubernetes kills containers that exceed their memory limit silently unless you're watching for it. These often precede more visible failures.
  • Node disk usage > 80% — Log storage on Kubernetes nodes fills up faster than you'd expect. Full disks cause some of the strangest, hardest-to-diagnose production failures.
  • Certificate expiry < 14 days — I've seen production outages from expired TLS certs. This alert costs nothing and prevents one of the most embarrassing failure modes.

The Dashboard That Runs During Incidents

We have a single "war room" Grafana dashboard that shows the six metrics our on-call engineer needs first during any incident: request rate, error rate (5xx percentage), p99 latency, database connection pool, active pod count, and CPU utilisation across nodes. This dashboard is pinned in our #ops Slack channel.

Having a pre-built incident dashboard sounds obvious. Before we built it, every incident started with five minutes of figuring out which dashboard to look at. Those five minutes matter when you're trying to resolve a P1.

What Made the Black Friday Incident Quick to Resolve

The alert fired before user impact reached critical levels. The war room dashboard immediately showed which metric was the problem. We'd seen connection pool exhaustion before (in staging, during load tests) and had a runbook for it. The runbook said: increase max_connections in the application configuration and restart the affected pods. We did that. The metrics recovered.

Good monitoring doesn't prevent incidents. It makes incidents shorter, less severe, and less stressful. That's the real value.