Wednesday, November 23, 2016

To skip or not to skip replication errors

Replication in MySQL is the standard way high availablity is achieved. Ineveitable, replication will break because of data inconsistency problems between the master and replicas.

Deciding what to do or what not to do when these errors happen is typically the DBA job. Sometimes, the error is not important to the application or the data and can be skipped. I onced worked with an application that would failover to a secondary master when it reached maxed connections and this would always result in the data inconsisties. The primary and secondary master were using master-master replication but the application would only write to one master at a time. The old primary master would not get removed from the cluster when the switchover happened and there would always be data differnces because of the way the application wrote data. For that particular application, I wrote a script that was monitoring replication and would skip the error and restart. It would do this all day long until the work day was over and we could schedule a planned swithcover and clean up the data.


Instead of writing a custom script to do this, MySQL has this option built in. There is an entry in the my.cnf called "slave-skip-errors" which can be used to automatically skip specific errors. For example, if you have data on the master that for whatever reason doesn't exist on the slave or vice versa, such as an insert that succeeded on the master but failed on the slave because that row already existed, or a DELETE on the master that succeeded but fails on the slave, those error numbers can be automatically skipped.

There is some risk involved here because the real problem that needs to be addressed is why is there missing data between master and slave? The root problem of the data inconsistiences really ought to be addressed.

I have one client that uses this in all of their my.cnf configs:

slave-skip-errors = 1032,1062

Some servers, this client also skips error 1452 and 1594.

I've also sometimes used this error number because it appears so many times:
1114

This site has a some great information on troubleshooting replication errors:

http://www.fromdual.com/replication-troubleshooting-classic-vs-gtid

Here is a related question and answer on stackexchange:

http://dba.stackexchange.com/questions/130366/which-error-codes-are-safe-for-slave-skip-errors-with-mysql-innodb

No comments:

Post a Comment