Percona Live was amazing this year. Being a MySQL database guy that lives and breathes this stuff for 10 hours a day, I really got a lot out of the conference. I have a lot of takeaways to improve my MySQL performance monitoring, better process for Linux systems troubleshooting and ideas to improve in other areas. I met a number of people from around the world. Also got to meet Bill Nye the science the guy.
Food was good (except for the last day, lunch was a boring box lunch). Thursday night there was a really fun game night with Segways, virtual reality, bowling, trampolines, darts. CO2 guns, video games and much more.
This is a summary based on the sessions I was able to to attend. There were 100+ sessions so I was only able to attend about 20 over the four days.
Facebooks is replacing all their installations of InnoDB with MyRocks (MySQL storage engine they created). I anticipate that Percona and MariaDB will include the MyRocksDB storage engine as part of their distributions at some point in the coming years. https://www.percona.com/live/data-performance-conference-2016/sessions/myrocks-deep-dive-flash-optimized-lsm-database-mysql-and-its-use-case-facebook
Prometheus and Grafana could replace older monitoring tools like Nagios and Ganglia:
https://www.percona.com/blog/2016/01/14/prometheus-as-an-engine-for-mysql-monitoring/
https://github.com/prometheus/mysqld_exporter
http://docs.grafana.org/datasources/prometheus/
Google Vitess looks pretty awesome and maybe a better clustering option than Percona Cluster:
https://github.com/youtube/vitess
Orchestrator looks very useful (I haven't used MHA but imagine this could replace it) - MySQL replication topology manager (this will be added to Vitess in a few months)
https://github.com/outbrain/orchestrator
Implementing GTIDs can get all your binary logs to be identical on each server https://www.percona.com/live/data-performance-conference-2016/sessions/mysql-gtid-implementation-maintenance-and-best-practices
DropBox has created a tool to visualize snapshots of performance schema. They are going to open source. I was thinking about creating something similar. https://www.percona.com/live/data-performance-conference-2016/sessions/troubleshooting-scenarios-performance-schema
Vault is looking very promising a way to manage users in MySQL: https://www.percona.com/live/data-performance-conference-2016/sessions/using-vault-decouple-secrets-applications
HAProxy appears to the load balancer of choice. At one of the birds of a feather sessions, almost everyone I spoke with said they install HAProxy on the application servers. It can handle 30K connection per instance and barely adds any load to the application server.
Percona Cluster (Galera Cluster) still suffers from slave lag and synchronous replication is not fully synchronous. Many people mentioned issues they still had with "synchronous replication" not really being synchronous fast enough. In addition to regular deadlocks with InnoDB, Galera Cluster introduces cluster deadlocks.
Systems Tuning...
http://www.brendangregg.com/USEmethod/use-linux.html
http://www.slideshare.net/brendangregg
http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
http://www.brendangregg.com/linuxperf.html
I learned about flame graphs during one of the systems tuning sessions...don't they look awesome?
http://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs
Free tools to try out:
DBSeer - https://github.com/barzan/dbseer (and https://github.com/dongyoungy/dbseer_middleware)
http://www.agildata.com/gibbs-mysql-scalability-advisor/
Met a guy from SalesForce who had also written a database compare schema tool: https://github.com/michaelxwang/mysql. He said he was equally frustrated by the lack of quality open source tools to do schema compares.
I asked a MySQL consultant who said he does a lot of work on database schema management and he said the most widely implemented solution he has done is "https://flywaydb.org/". It uses migrations and isn't a "state" based tool.
Also Solar Winds finally made 12 step tuning info graphic for MySQL (already had one for Oracle and SQL Server) :
http://www.solarwinds.com/assets/infographics/dpa-mysql-12-steps.aspx
I asked several people what they set their wait_timeout value to and they were all less than 10 seconds.
Friday, April 22, 2016
Friday, April 15, 2016
Clearing out logs files
I frequently need to truncate log files. Instead of deleting the file, re-creating the file and then re-setting permissions, I have two ways I've learned to leave the file "as is" but remove all the data.
Method 1: Truncate the size to zero
sudo -u mysql truncate -s 0 file_name
Method 2: Write "nothing" to the file
sudo -u mysql cat /dev/null > file_name
Generally method 1 is preferred because it requires fewer permissions to do that.
Method 1: Truncate the size to zero
sudo -u mysql truncate -s 0 file_name
Method 2: Write "nothing" to the file
sudo -u mysql cat /dev/null > file_name
Generally method 1 is preferred because it requires fewer permissions to do that.
Dirty reads - non locking SELECTs and mysqldump
There is a MySQL instance at work which is the bane of my work life existence. We put all our new clients on it (architectural decision made long before I joined the company) and so its end up having thousands of tiny sharded databases. Each of those databases has hundreds of tables even though their size is small. It is running an old version of MySQL (in the process of getting ready to upgrade). I have a number of scripts that collect information from our databases and also collect data from the information schema. Whenever my scripts or monitoring tools encounter this particular instance things start to be problematic. Queries against the information schema become extremely slow and some of them take 20-30 minutes where as they work very fast on other servers. Monitoring tools start to trip over themselves. Soon Nagios starts alerting about slave lag.
The moment I try to run a query on the information schema on this particular MySQL instance with 2000 databases (around 1 millions tables on the server) then the monitoring team immediately complains about how it makes slave lag alerts show up on their dashboard. My scripts don't need to do locking but the SELECT statements implicitly do locking. To stop this I've tried to use dirty reads in scripts but it still doesn't work. I'm guess they don't work on the information_schema because they are not InnoDB and the transactions only work with InnoDB tables.
Dirty reads are what I refer to as Non-locking SELECT. Sometimes it is acceptable to SELECT “old” data that is in the process of being changed by another session but their session updates have not yet been written. The data pages that have not been committed yet are called “dirty pages”. Performing non locking SELECTS are sometimes called “Dirty reads” and has the advantage of being faster because it ignores the locking being performed by other DML (deletes, updates, inserts) and SELECTS.
The disadvantage is you don’t get the most recent data, and might only get partial data, violating the “consistency” of the data. Dirty reads can be done at a session level or a transaction level. This is how it is done:
Session Level
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ;
-- run your queries
SELECT column_1,column_2 FROM TABLE_NAME ;
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
Transaction Level
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ;
-- run your queries
SELECT column_1,column_2 FROM TABLE_NAME ;
COMMIT ;
MySQL dump
mysqldump -u'USERNAME' -p -h'db1' --databases information_schema --skip-lock-table > information_schema_test.sql
mysqldump -h'db1' -u'USERNAME' -p --databases database1 database2 --lock-tables=false > dump_data.sql
The moment I try to run a query on the information schema on this particular MySQL instance with 2000 databases (around 1 millions tables on the server) then the monitoring team immediately complains about how it makes slave lag alerts show up on their dashboard. My scripts don't need to do locking but the SELECT statements implicitly do locking. To stop this I've tried to use dirty reads in scripts but it still doesn't work. I'm guess they don't work on the information_schema because they are not InnoDB and the transactions only work with InnoDB tables.
Dirty reads are what I refer to as Non-locking SELECT. Sometimes it is acceptable to SELECT “old” data that is in the process of being changed by another session but their session updates have not yet been written. The data pages that have not been committed yet are called “dirty pages”. Performing non locking SELECTS are sometimes called “Dirty reads” and has the advantage of being faster because it ignores the locking being performed by other DML (deletes, updates, inserts) and SELECTS.
The disadvantage is you don’t get the most recent data, and might only get partial data, violating the “consistency” of the data. Dirty reads can be done at a session level or a transaction level. This is how it is done:
Session Level
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ;
-- run your queries
SELECT column_1,column_2 FROM TABLE_NAME ;
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
Transaction Level
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ;
-- run your queries
SELECT column_1,column_2 FROM TABLE_NAME ;
COMMIT ;
MySQL dump
mysqldump -u'USERNAME' -p -h'db1' --databases information_schema --skip-lock-table > information_schema_test.sql
mysqldump -h'db1' -u'USERNAME' -p --databases database1 database2 --lock-tables=false > dump_data.sql
Wednesday, April 6, 2016
pt-stalk
Another extremely useful tool for troubleshooting and gathering forensic data on MySQL database servers is pt-stalk from the Percona Toolkit. It can run in the background and wait for a certain threshold to occur and then collect useful information about what was going on with the MySQL instances and the Linux server when the threshold was passed. This will help to find sick MySQL servers.
Box.com has made a GUI tool based off of pt-stalk called "RainGaguge" which I also like (it hasn't seen much love in a coupe years though).
Here are a couple examples or running pt-stalk using different variables to monitor:
pt-stalk --notify-by-email=me@me.com --function status --variable Threads_connected --threshold 300 --sleep=60 --daemonize --dest=/tmp/ptstalk/ --user=USER --password=xxxxxxx
pt-stalk --notify-by-email=me@me.com --function status --variable Threads_running --threshold 50 --sleep=60 --daemonize --dest=/tmp/ptstalk/ --user=USER --password=xxxxxxx
Monday, April 4, 2016
InnoDB Lock wait timeout
Today I was troubleshooting an application that runs DELETES on a table frequently. The table is used to schedule work. There are entries scheduled for the future and when the time comes, it performs the work that was scheduled. The table was previously MyISAM but I had it converted to InnoDB in order to improve throughput because the locking had become too excessive. After it was converted to InnoDB we started to get these type of errors:
MYSQL ERROR MESSAGE:
------------------
Query execution was interrupted
BAD QUERY:
------------------
db: database_server_1 -- DELETE FROM my_table
WHERE id = '200'
AND col1 = 'xxx'
AND col2 = '222'
AND col3 = 0
AND col4 = 0
MYSQL ERROR MESSAGE:
------------------
SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
BAD QUERY:
------------------
db: database_server_1 -- DELETE FROM my_table
WHERE id = '100'
AND col1 = 'xxx'
AND col2 = '123'
AND col3 = 0
AND col4 = 0
I did some profiling and watched the queries come in and the DELETES were taking over 50 seconds. These are pretty simple DELETES and should be finishing much faster. I did some searching and came upon this link in stack overflow:
http://stackoverflow.com/questions/6000336/how-to-debug-lock-wait-timeout-exceeded
I checked innodb_lock_wait_timeout on my server and I hadn't changed it, so it was still at the default of 50 seconds. Instead of increasing the innodb_lock_wait_timeout to something larger, I took a look at the table. It is heavily indexed and has over 6 millions rows. We had kept several years worth of historical data in the table without cleaning it up. We only needed 7 days worth of historical data! I got permission to clean it up. The table was too large to reasonably DELETE the old records and because of the DML queries constantly coming in, I couldn't do anything without introducing deadlocks. I got a short downtime scheduled and ran these commands and had it cleaned up in less than 10 seconds.
CREATE TABLE IF NOT EXISTS my_table_new LIKE my_table;
LOCK TABLE my_table WRITE, my_table_new WRITE;
INSERT INTO my_table_new
SELECT * FROM my_table
WHERE `date_created` BETWEEN '2016-03-27 23:59:59' AND '2030-03-27 23:59:59';
DROP TABLE IF EXISTS my_table;
UNLOCK TABLES;
RENAME TABLE `my_table_new` TO `my_table`;
The new row count was less than 200,000. After decreasing the table row count, the DELETE queries were sub second and finishing too quickly to see in a SHOW PROCESSLIST command.
MYSQL ERROR MESSAGE:
------------------
Query execution was interrupted
BAD QUERY:
------------------
db: database_server_1 -- DELETE FROM my_table
WHERE id = '200'
AND col1 = 'xxx'
AND col2 = '222'
AND col3 = 0
AND col4 = 0
MYSQL ERROR MESSAGE:
------------------
SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
BAD QUERY:
------------------
db: database_server_1 -- DELETE FROM my_table
WHERE id = '100'
AND col1 = 'xxx'
AND col2 = '123'
AND col3 = 0
AND col4 = 0
I did some profiling and watched the queries come in and the DELETES were taking over 50 seconds. These are pretty simple DELETES and should be finishing much faster. I did some searching and came upon this link in stack overflow:
http://stackoverflow.com/questions/6000336/how-to-debug-lock-wait-timeout-exceeded
I checked innodb_lock_wait_timeout on my server and I hadn't changed it, so it was still at the default of 50 seconds. Instead of increasing the innodb_lock_wait_timeout to something larger, I took a look at the table. It is heavily indexed and has over 6 millions rows. We had kept several years worth of historical data in the table without cleaning it up. We only needed 7 days worth of historical data! I got permission to clean it up. The table was too large to reasonably DELETE the old records and because of the DML queries constantly coming in, I couldn't do anything without introducing deadlocks. I got a short downtime scheduled and ran these commands and had it cleaned up in less than 10 seconds.
CREATE TABLE IF NOT EXISTS my_table_new LIKE my_table;
LOCK TABLE my_table WRITE, my_table_new WRITE;
INSERT INTO my_table_new
SELECT * FROM my_table
WHERE `date_created` BETWEEN '2016-03-27 23:59:59' AND '2030-03-27 23:59:59';
DROP TABLE IF EXISTS my_table;
UNLOCK TABLES;
RENAME TABLE `my_table_new` TO `my_table`;
The new row count was less than 200,000. After decreasing the table row count, the DELETE queries were sub second and finishing too quickly to see in a SHOW PROCESSLIST command.
Moral of the story is that one way to improve query performance is simply to work with smaller data sets. Archive data, shard data, delete data, or partition data so that your queries are working with smaller data sets and DML commands are updating smaller indexes.
Subscribe to:
Posts (Atom)