I spent three weeks building my first "real" machine learning model. A classifier for customer churn — take customer activity data, predict whether they'd cancel their subscription in the next 30 days. I tried five different algorithms, tuned hyperparameters, optimised the feature set. When I got the accuracy to 99.2% I felt like I'd cracked something.

Then I deployed it to score our actual customer base. It predicted that virtually nobody would churn. Over the next month, our actual churn rate was completely normal. The model was useless.

What Went Wrong: Part One (Class Imbalance)

Our churn rate was about 3%. In a dataset with 3% positive examples, a model that predicts "not churned" for literally every row will achieve 97% accuracy. My 99.2% model was barely better than this trivially bad baseline.

Accuracy is the wrong metric for imbalanced classification problems. The right metrics are precision, recall, and AUC-ROC — metrics that specifically measure performance on the minority class, which is the thing you actually care about. I was optimising for the wrong number entirely.

What Went Wrong: Part Two (Data Leakage)

After fixing the class imbalance problem (oversampling the minority class, switching to AUC-ROC as my optimisation metric), my model still performed unusually well. Too well. Which is usually a sign something is wrong.

After a code review with a senior data scientist, we found data leakage: I had inadvertently included a feature that was derived from future information. Specifically, I was including the customer's support ticket count for their entire history — including tickets filed after the date I was predicting at. The model had learned that customers who would churn filed more support tickets after churning (or near churn). Of course it could predict churn perfectly — it was cheating.

Data leakage is one of the most common errors in ML projects and one of the hardest to catch because the model gives you exactly what you want — high performance numbers — and you don't know the number is wrong until reality proves otherwise.

The Lessons, Stated Plainly

First: for imbalanced classification, forget accuracy. Use precision, recall, F1, and AUC-ROC. Understanding what these metrics measure and when to use which is foundational, not advanced.

Second: always think carefully about temporal integrity in your features. If you're predicting something at time T, your features should only use information available at time T. Anything that "knows the future" will leak into your model and produce falsely optimistic metrics.

Third: be suspicious of excellent results. A model that performs dramatically better than the domain baseline is usually either finding a real signal (rare) or has a problem (common). Interrogate good results with the same rigour as bad ones.

The Model We Actually Shipped

Three months after the first failure, we shipped a churn model with an AUC of 0.74. Not 99% accuracy. Not even close to that. But it correctly identified churners at a rate that was meaningfully better than random, allowed us to target retention outreach, and reduced churn by 8% in the first quarter. That's a useful model. The 99% accuracy one was a fiction.