What's your worst production migration story? I'll share mine so you feel better about yours.
Two years ago, I ran a migration that added a NOT NULL column with a default value to a table with 80 million rows. On a Friday afternoon. Without testing it against production-sized data first.
For those who haven't experienced this particular joy: in older Postgres versions (pre-11), adding a NOT NULL column with a default required rewriting the entire table. My "simple" schema change locked the table for 47 minutes. During peak traffic. The on-call Slack channel still haunts my dreams.
Lessons learned the hard way:
Always test migrations against production-volume data. Your dev database with 500 rows will lie to you about how long things take. We now maintain a staging environment with anonymised production data specifically for this.
Know your Postgres version's behaviour. Postgres 11+ can add columns with defaults without a table rewrite in most cases. But if you're on an older version, or your default is a volatile function, you're still in rewrite territory.
Have a rollback plan before you start. "We'll figure it out if something goes wrong" is not a plan. I now write the rollback migration before I write the forward migration.
Separate deployments from migrations. We moved to a model where code deploys and schema changes are independent events. New code handles both the old and new schema, migration runs whenever it's safe, then we clean up the compatibility code later.
Friday deployments are not a thing. This one should've been obvious, but apparently I needed to learn it experientially.
I've heard even worse stories—dropped columns that were still being referenced, foreign key additions that locked multiple tables in a cascade, character encoding changes that corrupted data. Makes me feel marginally better about my 47 minutes of downtime.
What's your migration war story? And more importantly, what did you change about your process afterwards?
0 Comments