Percona Series / Break online schema change

PumpkinSeed
4 min readNov 3, 2020

--

If you are not familiar with Percona online schema change, check the previous post about it. It has a massive explanation describing what it is and how it works. With some real-life scenarios.
But it still lacks the answer for the question what happens if something goes wrong. As you have seen it takes 20–40 mins to perform a schema change. This time range is so long, anything can happen there like: power outage, some other service consumes the necessary resources and stops it or you just change your mind and press Ctrl+C. In this post I want to dig into details what happens if failure takes over.

SIGINT

Pressing Ctrl+C means sending a SIGINT (program interrupt), which indicates a graceful shutdown of the program. This means the program will have time to cleanup the mess what it has done (?!) or let you do that…

^C# Exiting on SIGINT.
Not dropping triggers because the tool was interrupted. To drop the triggers, execute:
DROP TRIGGER IF EXISTS `fluidpay`.`pt_osc_fluidpay_tx_del`
DROP TRIGGER IF EXISTS `fluidpay`.`pt_osc_fluidpay_tx_upd`
DROP TRIGGER IF EXISTS `fluidpay`.`pt_osc_fluidpay_tx_ins`
Not dropping the new table `fluidpay`.`__tx_new` because the tool was interrupted. To drop the new table, execute:
DROP TABLE IF EXISTS `fluidpay`.`__tx_new`;
`fluidpay`.`tx` was not altered.

I pressed Ctrl+C in the middle of the copy and that was the message. It’s not a huge problem, because the triggers load the data into both tables. So if I do a count(*) on the tx I will get 10.477.088 however previously it was 10.476.887, so it proves that there is no data-loss. The only issue is that the __tx_new grows as well. So if we don’t apply the necessary DROPs the database will be surprisingly big.

SIGKILL

If you want to run the same command right after the SIGINT it will complain about the triggers and offer to use --preserve-triggers. IF you didn’t read the original documentation like me, you will end up getting the Trigger already exists error. So this flag only works with 5.7.2 or above because this version supports multiple triggers on a table.

After I cleaned the existing triggers and I deleted the new temporary table I started the command again.

$ ps aux | grep pt-online-schema-changeroot      106813  1.4  0.4 104724 33380 pts/2    S+   14:13   0:00 /usr/bin/perl /bin/pt-online-schema-change --alter ADD COLUMN data_field4 INT --execute --critical-load Threads_running=200 --ask-pass D=fluidpay,t=tx$ kill -9 106813

So what happened with the command? The same as in the SIGINT case without the cleanup. But if you read the previous section you see that we lost nothing.

mysql> SHOW TRIGGERS;
...
3 rows in set (0.01 sec)

So we have the triggers what we should cleanup manually, again.

Kill the Percona node

Currently I’m running the mysql nodes in a tmux session so the systemctl won’t stop it. I started the schema-change again, and I manually killed the mysql instance.

Copying `fluidpay`.`tx`:  11% 11:39 remain
2020-11-03T11:10:37 Dropping triggers...
2020-11-03T11:10:37 Error dropping trigger: DBI connect('fluidpay;;mysql_read_default_group=client','',...) failed: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) at /bin/pt-online-schema-change line 2345.2020-11-03T11:10:37 Error dropping trigger: DBI connect('fluidpay;;mysql_read_default_group=client','',...) failed: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) at /bin/pt-online-schema-change line 2345.2020-11-03T11:10:37 Error dropping trigger: DBI connect('fluidpay;;mysql_read_default_group=client','',...) failed: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) at /bin/pt-online-schema-change line 2345.2020-11-03T11:10:37 To try dropping the triggers again, execute:
DROP TRIGGER IF EXISTS `fluidpay`.`pt_osc_fluidpay_tx_del`
DROP TRIGGER IF EXISTS `fluidpay`.`pt_osc_fluidpay_tx_upd`
DROP TRIGGER IF EXISTS `fluidpay`.`pt_osc_fluidpay_tx_ins`
`fluidpay`.`tx` was not altered.
(in cleanup) 2020-11-03T11:10:37 Error copying rows from `fluidpay`.`tx` to `fluidpay`.`_tx_new`: DBI connect('fluidpay;;mysql_read_default_group=client','',...) failed: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) at /bin/pt-online-schema-change line 2345.2020-11-03T11:10:37 Error copying rows from `fluidpay`.`tx` to `fluidpay`.`_tx_new`: DBI connect('fluidpay;;mysql_read_default_group=client','',...) failed: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) at /bin/pt-online-schema-change line 2345.

I tried to separate them based on the main issues. If we analyse what we see here, the schema-change notices that the underlying mysql instance isn’t available anymore so it initiates a cleanup. Since the cleanup wants to clean the mess on the instance itself we will get 3 connection errors for the 3 drops of the triggers. The last error that we see was the actual error. It tells us that it couldn’t perform the copy of the rows.

After that I started the instance again, and everything went back to normal. Cluster size is 2 again, and the synchronisation works properly as well. That was the node2 and the data loader tool used the node1 so there was no data-loss at all.

In this case it makes sense to to keep up an instance specifically for the schema-change. If it kills the node for whatever reason that won’t affect the other part of the system.

Conclusion

It’s really hard to break a working cluster with the schema-change tool. Every part of the tool has been made for safety, so in-case of failure the schema-change immediately breaks the execution and notifies us about the cleanup procedure.

The chance of data-loss is minimal, because the only action which touches the original table is the RENAME TABLE. It takes nearly one second. Since it’s an atomic operation of the database I don’t actually think that can fail with data-corruption.

--

--