MySQL 8 DDL Without Downtime: Running ALTER TABLE on Large Production Tables

#ALTER TABLE MySQL 8
Sandor Farkas - Founder & Lead Developer at Wolf-Tech

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

The outage almost always starts the same way. A migration that ran in two seconds against the staging database, where the table held forty thousand rows, is deployed against production, where the same table holds ninety million. A few seconds later the application starts timing out, the connection pool saturates, and the on-call engineer is staring at a query that says ALTER TABLE and a metadata lock that will not let go. Running ALTER TABLE on MySQL 8 against a large, busy production table is one of the few operations that can take a healthy system down in under a minute, and it does so quietly, because nothing is technically broken. The schema change is simply holding a lock that every other query is now queued behind.

The good news is that this is a solved problem. MySQL 8 ships with real online DDL, and where the built-in support runs out, mature external tools fill the gap. What separates teams that change schemas calmly from teams that schedule maintenance windows and pray is not luck. It is knowing exactly which operations are safe to run in place, which ones lock, and which ones need a copy-and-swap tool. This post walks through all three.

Why ALTER TABLE on MySQL 8 Locks When You Least Expect It

Before MySQL 5.6, almost every ALTER TABLE rebuilt the entire table and held an exclusive lock for the duration. MySQL 8 is far better, but the improvement is uneven, and that unevenness is the trap. Some operations run instantly as a metadata-only change. Some run online, copying data in the background while reads and writes continue. And some still rebuild the table under a lock that blocks everything.

MySQL classifies each DDL operation by two properties that decide your fate. The first is the algorithm: INSTANT, INPLACE, or COPY. The second is whether the operation permits concurrent DML while it runs. An INSTANT change, such as adding a column at the end of the table in recent MySQL 8 releases, only touches metadata and finishes in milliseconds regardless of table size. An INPLACE change, such as adding a secondary index, rebuilds index structures but generally lets reads and writes continue. A COPY change rebuilds the whole table and blocks writes for the entire copy, which on a ninety-million-row table can mean minutes of effective downtime.

The dangerous part is that MySQL will silently fall back to a more expensive algorithm if the one you assumed is not supported for that specific change. You think you are running an instant column add; MySQL decides it needs a full copy because of a column position or a row format constraint, and you find out in production. The defense is to never guess. State the algorithm and lock level explicitly, and let MySQL reject the statement if it cannot honor them:

ALTER TABLE orders
  ADD COLUMN fulfilled_at DATETIME NULL,
  ALGORITHM=INSTANT;

If MySQL cannot perform this as an instant change, the statement fails immediately with an error instead of quietly rebuilding the table. That failure is a gift. It happens in code review or in a migration dry run, not at 3pm under peak load. Make ALGORITHM=INSTANT or ALGORITHM=INPLACE, LOCK=NONE an explicit part of every migration, and treat any statement that will not accept those clauses as a statement that needs a different strategy.

The Operations That Are Genuinely Safe in Place

A useful mental model is to sort every schema change into three buckets before it ever reaches production. The first bucket is instant or near-instant operations that you can run any time, on any table, under load. In MySQL 8 this includes adding a nullable column, adding a column with a default in recent versions, renaming a column, dropping a column in 8.0.29 and later via the instant algorithm, setting or dropping a column default, and renaming an index. These touch metadata or run in place and do not meaningfully threaten availability.

The second bucket is online but expensive operations. Adding or dropping a secondary index, changing a column to a compatible type, and adding a foreign key generally run with ALGORITHM=INPLACE and LOCK=NONE, so concurrent traffic continues, but they still rebuild substantial structures and consume IO and CPU. On a large table these can run for many minutes. They are safe in the sense that they do not block, but they are not free. Run them during lower-traffic periods, watch replication lag, and be ready for elevated latency while they proceed.

The third bucket is the operations that still require a full table copy under MySQL 8: changing a column to an incompatible type, changing the primary key, converting the character set, and certain row format changes. These are the ones that lock, and these are the ones you must not run with a plain ALTER TABLE on a large production table. They need an external online schema change tool.

The discipline that prevents incidents is simply running every migration through this triage first. A short audit of your pending migrations against these buckets catches the dangerous ones before they ship, and it is exactly the kind of review we fold into our code quality consulting work when a team's deployment pipeline keeps producing surprises.

When You Need pt-online-schema-change or gh-ost

For the third bucket, the underlying technique is always the same. You cannot alter the table in place without locking, so you build a new table with the desired schema, copy the existing rows into it in small batches, capture any changes that happen during the copy, and then swap the new table in for the old one in a single atomic rename. Two tools have made this technique boring and reliable: pt-online-schema-change from Percona, and gh-ost from GitHub.

pt-online-schema-change, usually shortened to pt-osc, uses triggers. When it starts, it creates the new table and installs AFTER INSERT, AFTER UPDATE, and AFTER DELETE triggers on the original table so that every live write is mirrored into the copy while the tool batches through the existing rows. A typical invocation looks like this:

pt-online-schema-change \
  --alter "ADD COLUMN fulfilled_at DATETIME NULL" \
  --execute \
  D=shop,t=orders

The trigger-based approach is robust and battle-tested, but the triggers add write overhead to every transaction touching the table while the migration runs, and they interact poorly with tables that already have triggers. On a very write-heavy table that overhead matters.

gh-ost takes a different path. Instead of triggers, it reads the MySQL binary log and replays the table's changes onto the copy from outside the database. Because it does not install triggers, it adds no synchronous write overhead to the original table, and crucially it is pausable and throttleable on the fly. You can slow it down or stop it entirely if it starts to affect production, then resume:

gh-ost \
  --database="shop" \
  --table="orders" \
  --alter="ADD COLUMN fulfilled_at DATETIME NULL" \
  --max-load=Threads_running=40 \
  --critical-load=Threads_running=80 \
  --execute

The practical choice between them usually comes down to your replication topology and write volume. gh-ost shines on write-heavy primaries and when you want fine-grained, observable throttling against replication lag. pt-osc is the safer default when you have foreign keys to deal with, since gh-ost's handling of foreign keys is deliberately conservative. Both tools do the same atomic cut-over at the end, and both turn a locking operation into a long-running but non-blocking one.

The Locking Pitfalls That Still Bite

Even with the right tool, a handful of pitfalls cause most of the remaining incidents. The first is metadata locks. Any ALTER TABLE, even an instant one, needs a brief metadata lock, and it cannot acquire that lock while a long-running transaction is holding the table open. A reporting query or a forgotten open transaction can block your migration, and worse, your migration now sits in the queue holding its own pending lock, behind which every new query stacks up. Always set a short lock-wait timeout so a blocked DDL fails fast instead of building a queue: SET SESSION lock_wait_timeout = 5; before the statement turns a silent pile-up into a clean, retryable error.

The second pitfall is the atomic cut-over itself. Both pt-osc and gh-ost finish with a rename that needs a momentary exclusive lock. It is brief, usually milliseconds, but if a long transaction is holding the table, even that brief lock can stall. Run the cut-over when you can confirm there are no long transactions in flight, and use the tools' built-in cut-over controls rather than forcing it.

The third pitfall is replication lag and disk. A copy-and-swap migration duplicates the entire table on disk before it drops the original, so a table that is half your free space cannot be migrated this way until you add capacity. And the row-by-row copy generates a large volume of replicated writes; on a constrained replica, that alone can push lag past acceptable thresholds and break read-after-write expectations downstream. Throttle on replication lag, not just on primary load, and confirm you have headroom on every replica before you start.

The fourth is foreign keys, which complicate every copy-based approach because the constraints reference the old table name and must be rebuilt against the new one. This is the single most common reason a schema change that looked routine turns into a careful, tool-specific procedure. If your schema leans heavily on foreign keys and large tables at the same time, that combination deserves a deliberate review rather than an ad-hoc migration.

Building This Into How You Ship

The teams that change schemas without drama are not braver. They have simply moved the decision earlier. Every migration is triaged into instant, online, or copy-and-swap before review. Every in-place ALTER TABLE on MySQL 8 carries an explicit ALGORITHM and LOCK clause so the database refuses anything unexpected. Every copy-and-swap goes through gh-ost or pt-osc with throttling tied to replication lag. And every one of these runs first against a production-sized dataset, never against a tiny staging table that hides the cost.

If your migrations keep producing surprises, or you are carrying a legacy MySQL schema that makes every change feel risky, that is usually a sign the data model and the deployment process need attention together rather than separately. We help teams untangle exactly this through our legacy code optimization and custom software development work, turning schema changes from a quarterly maintenance window into a routine deployment.

If you want a second set of eyes on a migration strategy before it ships, write to us at hello@wolf-tech.io or read more about how we work at wolf-tech.io. A short conversation before the next big ALTER TABLE is a great deal cheaper than the postmortem after it.