Monday, October 23, 2017

Slave skip counter terminates the entire transaction

If you have been an admin for MySQL with replication, at some point you are going to need to skip statements on a replica that passed on the master but are failing on the slave. There are too many reason this can happen and depending on the situation it might be safe to skip or might lead to data inconsistencies on the replica.

When you skip the error, it skips the entire transaction. Even if the first part of the transaction would have been successful because there are multiple statements in the transaction and one of them fails, then all of them are skipped on the slave.

This has been accurately described by Jervin Real in this blog post:

https://www.percona.com/blog/2013/07/23/another-reason-why-sql_slave_skip_counter-is-bad-in-mysql/

Something to note is that skipping errors on Aurora is different than on normal MySQL server:

On a normal MySQL slave you would run this;

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;

show slave status;

If you have thousands of errors and want to skip them all then you can put a large number in the SQL_SLAVE_SKIP_COUNTER. If you have that many error then you probably have larger issues and might want to consider rebuilding the slave to prevent data inconsistency problems.

However on Aurora you can only skip one error at a time:

CALL mysql.rds_skip_repl_error;
show slave status;
 
You don't have to stop and start replication but do need to use the stored procedure provided by the Aurora folks.

I have a client that uses Aurora as  a disaster recovery "site". The client has thousands of databases that are being replicated from their local data center into Aurora. If their local data center were to go down, they have all their data in Aurora. However, periodically some statement breaks replication and I have to skip replication if it is safe to do so.

One of the semi frustrating things I have to deal with is the widespread use of the BLACKHOLE storage engine in Aurora by my client. When replicating from the local data center to Aurora, the client does not always want to replicate every single database into Aurora. There might be hundreds of databases on a single server and the client only want to replicate one of them into Aurora. What I do is export and import all the tables for all the database into Aurora. Then I change all the tables for the databases I don't want to replicate to BLACKHOLE. However, eventually people start creating new tables, modifying existing tables on the master server which is located in the local data center and those DDL changes are replicated to the database for which I have set all the tables to BLACKHOLE. Eventually this causes replication to break. Because I don't care about the data in the BLACKHOLE tables I can skip all the errors but sometimes I have to skip hundreds of errors.

Thursday, October 19, 2017

How to break replication to Aurora instance slaves

I have a client which is migrating database to the AWS cloud. We are setting up Aurora slaves from the physical data to replicate all data. Today I was updating my permissions for my own user and ran a query like this on all the physical machines:

GRANT ALL PRIVILEGES ON *.* TO 'me'@'%' IDENTIFIED BY PASSWORD '*PASSWORDHASH' WITH GRANT OPTION;

That query ran fine on all the physical machine in the data center but it broke replication on every single Aurora instance that was replicating. This is because there are many permissions which are not allowed in Aurora because only the "SUPER" user can have them and that user is controlled by Amazon. When granting permissions or changing password hash, we have to be careful to only run grant statements which will not break replication on Aurora.

Here is an interesting list of other things you can learn from doing migrations to amazon rds:

https://www.percona.com/blog/2014/07/28/what-i-learned-while-migrating-a-customer-mysql-installation-to-amazon-rds/

Tuesday, October 10, 2017

if else / case statements in MySQL queries

Being able to change what is displayed in the query or to display the results of a different column based on criteria in a WHERE clause is super helpful. Because this is so useful and something I use so frequently I writing a blog post on it.

Here is am example query I wrote. I want to display the Aurora "EndPoint" alias when I provide an IP Address. I've created two tables, one which has RDS/Aurora Cluster details, and one with RDS/Aurora Instance details. I've pulled this information from the AWS API and stored it locally to easily query it in a relational database. If the IP Address turns out to be the writer end point then I want the c.EndPoint column value to be displayed. If the IP Address is for reader end point then I want the c.ReaderEndpoint to be displayed. If there is only one instance in the cluster then this query will always return c.EndPoint (writer end point).


    SELECT
    IF(i.IsClusterWriter = 1,  c.EndPoint,  c.ReaderEndpoint ) AS alias
    FROM RDSCluster c
    INNER JOIN RDSInstance i ON (i.DBClusterIdentifier = c.DBClusterIdentifier)
    WHERE i.IPAddress = 'xx.xx.xxx.xx';

With a case statement, I could also write it like this:

    SELECT
    (case when (i.IsClusterWriter = 1)
 THEN
      c.EndPoint
 ELSE
      c.ReaderEndpoint
 END)
 as alias
     FROM RDSCluster c
    INNER JOIN RDSInstance i ON (i.DBClusterIdentifier = c.DBClusterIdentifier)
    WHERE i.IPAddress = 'xx.xx.xxx.xx'

Other examples:

https://stackoverflow.com/questions/8763310/how-do-write-if-else-statement-in-a-mysql-query

Tuesday, October 3, 2017

Clearing the buffer cache before starting MySQL - Flush out the file system cache

Here was as good question I found on stackexchange:

How do you empty the buffers and cache on a Linux system?

http://unix.stackexchange.com/questions/87908/how-do-you-empty-the-buffers-and-cache-on-a-linux-system

If I don't flush out the file system cache when restarting MySQL on prod severs, I frequently get timeouts when restarting. The servers I typically work with have about 250GB of RAM, 40 CPU. Something like this will happen where I have to kill the start command or it just doesn't start.

# service mysql status
MySQL (Percona Server) running (182654)                    [  OK  ]
# service mysql restart
Shutting down MySQL (Percona Server).......................[  OK  ]......................................
Starting MySQL (Percona Server).......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................^C

However, if I run this:

sync && echo 3 > /proc/sys/vm/drop_caches

MySQL starts up quick.

Starting MySQL (Percona Server)............................[  OK  ]....................