The decision seemed straightforward. We were on a managed VPS provider paying about $1,200/month. Moving to AWS would give us better scaling, more services, and probably save money once we optimised. "It's just cloud infrastructure," I told our investors. "We'll be done in six weeks."
It took six months. We had seven unplanned downtime incidents over that period. One incident resulted in read-only database access for four hours during our peak usage time. I did not give another infrastructure update in an investor call for the rest of the year.
What We Got Wrong Before We Started
We treated the migration as a technical project rather than a product risk. We scheduled it during our busiest quarter because "we needed the new infrastructure for the upcoming scale." We didn't run any production load on the new environment before cutting over. We had a rollback plan on paper that we'd never practiced.
Every one of these decisions is something experienced engineers would have caught in a planning review. We didn't have that review.
The Database Incident
We migrated our PostgreSQL database from a managed provider to RDS. The migration itself was clean. What we missed: our application had hardcoded connection pool settings tuned for the old server's connection limits. RDS has different defaults, and under load, we exhausted the connection pool. The application began failing with connection timeout errors, which our monitoring logged — but we hadn't set up alerts on that specific error type yet.
It took two hours to diagnose because the errors looked superficially like a different database issue we'd seen before. By the time we found and fixed the connection pool configuration, we'd had four hours of degraded service.
What Should Have Happened
The practices that would have prevented most of our incidents are not exotic. They're standard things experienced infrastructure engineers do:
- Parallel running: Migrate to the new environment and run it in parallel with the old one before cutting over. This surfaces configuration issues without user impact.
- Load testing before cutover: The connection pool issue would have appeared immediately under synthetic load.
- Practiced rollback: Having a documented, tested rollback procedure means you can abort quickly instead of scrambling to improvise.
- Feature flags or gradual traffic shifting: Start with 5% of traffic on the new environment. Problems affect 5% of users, not 100%.
- Alerting before you need it: Set up monitoring for the new environment before cutting over, not after the first incident.
The Actual Outcome
We're fully on AWS now and it was worth doing. The new architecture scales better, costs slightly less, and we have capabilities we didn't have before. But I'd have made very different choices about the process if I could do it again.
Infrastructure migrations are product risks, not just technical tasks. They deserve the same planning, testing, and gradual rollout you'd give to a major feature launch.