Tuesday, June 23, 2015

pt-upgrade

I love the idea of pt-upgrade in the Percona Toolkit but have struggled to effectively use the tool.

I've taken a backup of all the databases of a database server and restored that backup onto two different hosts, one running MySQL 5.1 and one running MySQL 5.6. I want to find failing queries that work on MySQL 5.1 but don't work on MySQL 5.6.

For initial testing, I just want to find SELECT statements that return different results or fail.

I've read over the documentation and this is how I'm executing the tool:

pt-upgrade h=SERVER1 -uUSERNAME -pPASSWORD  h=SERVER2 uUSERNAME -pPASSWORD --type='genlog' --max-class-size=2 --max-examples=2  --run-time=10m 'general_log_mysqld.log' 1>report.txt 2>err.txt &

I've chosen to use the general log in this example instead of the slow query log but I've tested with both. The slow query log is more bloated with useless info while the general log has more bang for your buck in terms of size. The big problem I have with this tool is that I have hundreds databases on my server. Inside the general log are lots of "use <database>" statements before each query. However, when I run the tool it complains about not being able to run the query because the tables are missing:


On both hosts:

DBD::mysql::st execute failed: No database selected [for Statement "....

I can't just select one database when I run the command, the tool needs to read the USE <database> from the log file.

I tried again and found a database server with only a few databases. Sometimes it was working and sometimes it was failing to use the correct databases. As a work around I created the same tables inside each database. This only worked if all the tables had unique names.  This way it would not matter which database the tool was running against. At this point, at least I got a report with useful results.

It isn't always feasible to do what I did by putting the same tables in every database...what am I doing wrong with this tool?

Issues I've had with Percona Upgrade:
  1. Queries pulled from production can be too intensive for a test machine running with limited resources. Queries I was testing with  were taken from systems with 30+ cores and 64 GB of RAM, queries would time out after a few minutes and the tool would stop working, queries taken from prod should be run on a prod like database servers, also queries like this "SELECT SQL_NO_CACHE * FROM table_name" which came from a backup job seemed  to break the tool
  2. Some queries broke the tool, this included not only the backup queries, but weird custom SQL that were written by developers internally
  3. The same query would happen so frequently in the logs (100s of thousands of times) that the reports became useless
  4. There were a lot of failed queries when running the SELECT statements because of missing tables, these were not true temporary tables but tables that are created for a short time and then dropped, these tables were not part of my dump and restore because I didn't know the code was creating these "transitory" tables
  5. The tool would not database context switch for me, I don't know why but when the logs would issue  a "use database_name" it would ignore it and try to run a query that was meant for a different database. The tool would report a huge number of failed queries because of this. My work around was to create the same tables in every schema so I would not get those failed queries anymore
After engaging Percona consulting they told me to only use the SLOW LOG files. For my system they said I would need to massage the files with pt-query-digest before running pt-upgrade like this:

Step 1: Massage a slow query log for SELECT statements using pt-query-digest
pt-query-digest --filter '$event->{arg} =~ m/^select/i' --sample 5 --no-report --output slowlog my_slow.log > my_slow_massaged_for_select.log
Step 2: Next run pt-upgrade
pt-upgrade h=my_source_host.com -uUSER -pPASSWORD h=my_target_host.com -uUSER -pPASSWORD --type=slowlog --max-class-size=1 --max-examples=1 --run-time=1m 'slow_massaged_for_select.log' 1> report_1.txt 2> error_1.txt &
The above massaging worked well for SELECT statements but when testing DDL/DML I ran into more problems. I would still see a lot errors for tables that don't exist because they are tmp tables and come and go during a session.

Step 1: Massage a slow query log for SELECT/DDL/DML statements using pt-query-digest
pt-query-digest --filter '$event->{arg} =~ m/^[select|alter|create|drop|insert|replace|update|delete]/i' --sample 5 --no-report --output slowlog alpha_slow.log > my_slow_massaged_for_dml_ddl.log
Step 2: Clean up the log file to remove LOCKS
Because my slow log file had a number of LOCK statements, I used SED to remove all rows that had any references to LOCKS.
Step 3: Next run pt-upgrade
pt-upgrade h=my_source_host.com -uUSER -pPASSWORD h=my_target_host.com -uUSER -pPASSWORD --type=slowlog --max-class-size=1 --max-examples=1 --run-time=1m --no-read-only 'my_slow_massaged_for_dml_ddl.log' 1> report_2.txt 2> error_2.txt &
Even running all the DDL, I would still see a lot errors for tables that don't exist because they are tmp tables and come and go during a session. At this point, I was able to get more confidence that the upgrade was going to work. 


Friday, June 19, 2015

Why is the triggers table in the information_schema so slow?

We have a sharded MySQL infrastructure at work where we sometimes create new shards from a .sql file. Each shard has all the same tables/triggers/functions, etc but the data is unique to the customer for which that shard is assigned to. This file is created from an alpha environment which has different users than production. This started to result in a situation where we are getting definers for triggers/functions on production but for users that do not exist. I wrote a script to send alerts for these but manually fixing them was getting annoying. The permanent fix so that doesn't happen anymore is in the works but the bureaucracy at work is taking too long so I wrote a bash script to fix them automatically on production. A portion of my bash script was inspired from this site:

http://codersresource.com/news/dzone-snippets/change-ownership-of-definer-and-triggers

Here is the script from the above site:

#!/bin/sh 
host='localhost' 
user='root' 
port='3306' 
# following should be the root@localhost password 
password='root@123' 

# triggers backup 
mysqldump -h$host -u$user -p$password -P$port --all-databases -d --no-create-info > triggers.sql 
if [[ $? -ne 0 ]]; then exit 81; fi 

# stored procedure backup 
mysqldump -h$host -u$user -p$password -P$port --all-databases --no-create-info --no-data -R --skip-triggers > procedures.sql 
if [[ $? -ne 0 ]]; then exit 91; fi 

# triggers backup 
mysqldump -h$host -u$user -p$password -P$port --all-databases -d --no-create-info | sed -e 's/DEFINER[ ]*=[ ]*[^*]*\*/\*/' > triggers_backup.sql 
if [[ $? -ne 0 ]]; then exit 101; fi 

# drop current triggers 
mysql -h$host -u$user -p$password -P$port -Bse"select CONCAT('drop trigger ', TRIGGER_SCHEMA, '.', TRIGGER_NAME, ';') from information_schema.triggers" | mysql -h$host -u$user -p$password -P$port 
if [[ $? -ne 0 ]]; then exit 111; fi 

# Restore from file, use root@localhost credentials 
mysql -h$host -u$user -p$password -P$port < triggers_backup.sql 
if [[ $? -ne 0 ]]; then exit 121; fi 

# change all the definers of stored procedures to root@localhost 
mysqldump -h$host -u$user -p$password -P$port --all-databases --no-create-info --no-data -R --skip-triggers | sed -e 's/DEFINER=[^*]*\*/\*/' | mysql -h$host -u$user -p$password -P$port 
if [[ $? -ne 0 ]]; then exit 131; fi 

My script was different but the idea was the same. However, it blew up and this is a bad idea. I tested it several times on a non-prod and it was working pretty good. My testing environment only had about a dozen databases. The particular MySQL server that has over 1500 databases. For whatever reason that I haven't been able to pinpoint, running a query on the triggers table takes 20 minutes. That little portion in the above script that creates a DROP triggers query with the SELECT concat and then sends the results back into another mysql session is not a good idea! It will cause queries to get locked up and hold up replication. It created slave lag in our clustered environment which got really far behind. There are plenty of other problem with our MySQL infrastructure which are out of my control which also contributed to this situation.

Long story short is that I'll be re-writing my version of the script to use SHOW TRIGGERS/SHOW FUNCTION STATUS/SHOW PROCEDURE STATUS because those commands run much faster. However, I'll have to loop over every single database and limit my query to only that database.

Tuesday, June 16, 2015

/bin/rm: Argument list too long

I was doing some replication testing with master-master on Percona 5.6 today and I kept having problems with my test instances not starting or shutting down when I changed the my.cnf setting file to use the new relay log files. I had previously used theses boxes for master-slave testing and had left it for weeks un-attended and replication had broken and relay log files were building up like crazy.

I ran this:
STOP SLAVE;
RESET SLAVE;

cd into this directory:


/var/log/mysql/

Lots of relay files like this:

relay.137105

I thought that RESET SLAVE was suppose to delete and those. Maybe it is and it just taking a long time.

So I try to delete them:

rm /var/log/mysql/relay.*
-bash: /bin/rm: Argument list too long

rm /var/log/mysql/relay.1*

bash: /bin/rm: Argument list too long

I was able to run this:

rm /var/log/mysql/relay.12*
rm /var/log/mysql/relay.13*
rm /var/log/mysql/relay.14*
and on and on

I didn't want to spend all night doing this.

I tried this:

cd /var/log/mysql/

find . -name "relay.*" -print | xargs rm
Copied from here: http://itigloo.com/how-do-i/binrm-argument-list-too-long-error/

Still a no go. It just hangs. Seems to be too much for my underpowered VM to handle.

I tried it again with only one file:

find . -name "relay.097436" -print | xargs rm

It worked. 

How about a few more files:

find . -name "relay.3*" -print | xargs rm

That worked.

I tried this again:


find . -name "relay.*" -print | xargs rm

At last it worked!!

Tuesday, June 2, 2015

SSL error mysql dump

I was trying to do a database dump (non locking) from this server that was running MySQL 5.5 and I kept getting this annoying SSL error when running the dump command from some of my linux boxes. The command looked like this:

mysqldump -h'servername.com' -uroot -p --routines --lock-tables=false --quick  --databases database_name   > dump.sql

Error would look like this:
ERROR 2026 (HY000): SSL connection error: error:00000001:lib(0):func(0):reason(1)

I found this post which helped me finally get a dump file:
http://stackoverflow.com/questions/31413031/mysql-error-2026-hy000-ssl-connection-error-error00000001lib0func0re

I just had to add in --skip ssl to get my data:

mysqldump -h'servername.com' -uroot -p --routines --lock-tables=false --quick --skip-ssl  --databases database_name   > dump.sql