Thursday, July 30, 2015

sub-select to find max values with a group

Several times recently I've needed to get the max or the min value within a group of values. For example, say I want the max id for a user each day or the max id for each user who has several entries in some kind of a history table.

 At first I struggled with how I was going to get that data but the solution was actually fairly simple with a sub-select and GROUP BY. Since I realized how easy this is to do, I've had to do a very similar thing at work quite often. It is nice to remember how useful GROUP BY can be. This is a similar query I wrote recently with a GROUP BY and a sub-select. 

This trick only works when the table has an auto-increment id. You also have to make an assumption for your data that the max auto increment id has the highest value you are looking for in the group. 

Might result in full index scans so it may not be the most optimal solution but it worked for what I needed it for:
    
    SET @day_number = 1;
    SELECT id
    FROM table_a
    WHERE date_created BETWEEN DATE_SUB(NOW(), INTERVAL @day_number DAY) AND NOW()
    AND value1 < (@day_number*86400)
    AND id IN (SELECT MAX(id) FROM table_a GROUP BY user_id);

Friday, July 24, 2015

Adding new MySQL users to an AWS MySQL instance and using an SSH tunnel

This may sound really stupid but I was creating users today on a MySQL instance that is hosted in AWS and couldn't figure out what hostname value to use for the new users in order to get them to work.

Whoever at my work setup the instance completely forgot the root password. They were managing this themselves and eventually it got thrown over the fence to my team.

This site has a great little tutorial on resetting the root password for MySQL when you have shell access to the box:
http://www.rackspace.com/knowledge_center/article/mysql-resetting-a-lost-mysql-root-password

Normally I work with MySQL instances in our own local data center. This MySQL instance was in AWS and couldn't be accessed without tunneling in via SSH. I wasn't sure what the host should for the MySQL users. At first I was limiting them to the IP range of my office. After several attempts to logon to the MySQL instances over the SSH tunnel it still wasn't working. At then it dawned on me that after connecting to the system via SSH you are localhost. So the mysql users have to be user_name@localhost. Worked fine after I realized that!

Tuesday, July 21, 2015

12 Steps to MySQL Query Tuning

I've been a fan of Solar Winds' 12 Steps to Query Performing Tuning Infographic which they published for Oracle and SQL Server. However, they didn't publish one for MySQL. Using the same 12 step model as in their infographic, I've written my own:

 12 Steps to MySQL Query Tuning


Step 1 - Examine the Execution Plan

Make sure you are working with tables, not views
Make Certain you know the row count
Review the Access Predicates (key_len, ref columns in EXPLAIN PLAN)
- Is all or part of composite index being used?
Are there full table scans? Are there full index scans?
Review the Extra column
- Do you see "Using temporary;" or "Using filesort;" (BAD)
- Do you see "Using Index" (GOOD)
Is the Possible keys column NULL?
Is the value in the Rows column high? (BAD)
Does the type column show RANGE?
- Is it having traverse large number of rows in the Index leaf nodes?
- You want to keep the scanned index range as small as possible
Examine the EXPLAIN EXTENDED
SHOW WARNINGS <- Run this right after EXPLAIN EXTENDED to review how the actual SQL will be executed

Step 2 - Review the filters, JOINS, WHERE predicates, ORDER BY, GROUP BY, HAVING BY, LIMIT

What is the filtered row count?
If there is a LIMIT clause? Is it occurring after or being the entire table is accessed?
Are there functions in the ORDER BY?
Review the data types to avoid implicit conversion
- truthy/falsey filtering in the Predicates (converting columns to Boolean)
- WHERE !TIMESTAMP (ignores index)
Don't wrap your indexes in expressions in the WHERE predicates
- WHERE DATE(TIMESTMAP) = '2010-10-10' (ignores index)
Should the query be using an INNER JOIN vs. LEFT JOIN?
Is the query unnecessarily filtering out duplicates? (UNION vs. UNION ALL)
Are there bind variables? Using prepared statements?

Step 3 - Know the Selectivity of the table data

Is the data skewed resulting in queries that cannot effectively use existing indexes?
Make Sure to work with the smallest possible logical set
Can additional WHERE predicates be added?
Know when the predicate is applied when dealing with multiple tables, you want the most SELECTIVE predicate earlier than later

Step 4 - Gather Table Information

Review Table definitions
Have the stats been updated? Get current stats.
Are there triggers on the table? If so, how often do they fire? What are they doing?
What is the storage engine? Does it allow for row level locking vs. table level locking?
Make sure that datatypes match for the columns that are being JOINED between two tables, including the CHARSET and COLLATION

Step 5 - Evaluate Existing Indexes

Look at Index Definitions
Are there redundant indexes?
Are there composite indexes?
Is there a primary key on the table?
Is the primary key an integer, char?
Is the Primary key unnecessarily large as in the case a large varchar primary key?
Are there far too many secondary indexes?
Know the Selectivity of the indexes
Is the cardinality for the indexes high/low?
Could a fulltext index be useful?
Is each column that could be JOINED indexed?
Check to see if covering indexes can be created (avoid duplication and overlapping of indexes)

Step 6 - Analyze Columns in the WHERE clause and SELECT

Look for SELECT * or scalar function (the more data brought back the less optimal it may be to use certain functions)
Look for CASE, CAST, CONVERT
Are there sub queries? Can the sub query be converted to a JOIN?

Step 7 - Run the Query and record baseline metrics

Gather average execution times
If you are using Master-Master, Master-Slave or some other type of cluster where each instance has the same data, test running the query on each replica to see if the execution time and/or execution plan changes

Step 8 - Adjust/Tune the query

Focus on the most expensive operations first (full table scans, index scans, file sort, temporary table)
Look for high cost steps to eliminate
Consider a covering index, which includes every column that satisfies the query
In the case of ORDER BY and GROUP BY, check to see if you can eliminate "Using temporary;" or "Using filesort;" by properly indexing the table
Try to eliminate the frequent counting of rows by utilizing summary tables
Rewrite sub queries as a join
Seek vs. Scans, which is more expensive?

NOTE: Out of date statistics can impact performance. Issue "ANAYLZE TABLE"
NOTE: Fragmented tables can impact performance - during a planned maintenance Issue "OPTIMIZE TABLE"

Step 9 - Re-run the query

Record the results and compare
Only make one change at a time

Step 10 - Consider adjusting indexes

Test using index hints to see if the optimizer is choosing poorly
Look to reduce logical I/O

Step 11 - Incremental changes

Continue to re-run query and record results after each incremental change
Have a peer review the query it with you to see if you are missing something

Step 12 - Engineer out the Stupid

Abuse of wildcards?
Use of views?
Scalar Functions in the query?
Can the query be cached? Does it make sense to be using MySQL query cache or a dedicated caching layer? Consider turning off MySQL query cache as the effort to check the cache can severely slow down your queries.
Is the query using a cursor in a stored procedure?
Does the server have enough RAM to keep indexes in memory?
Are the data files stored on fast drives?
Are all the tables in your query using the same storage engine?
Do all your tables have Primary Keys, proper secondary indexes?
Is your table normalized/de-normalized as appropriate?
Is your MySQL server so heavily overloaded that queries are performing poorly?
Has the query been tested on large and small data sets?
Is the query being used on a sharded database where different databases have the same tables but possibly very different data sets?
Is the query doing silly things like over use of the OR/UNION command when an IN clause may be more appropriate?
Are you forcing the query to do full table scan by putting wildcard characters (%) on the left side of an index?
Is the table definition using the right data types? Abuse of CHAR(255), VARCHAR(255), BIGINT?
Can a complex query with dependent sub queries be split into multiple queries and/or use transactions?

UPDATE April 2016:
Solar Winds finally came out with an infographic for MySQL:
http://www.solarwinds.com/assets/infographics/dpa-mysql-12-steps.aspx



Tuesday, July 14, 2015

time zone errors when replaying logs


I kept getting this annoying error on my test instances of MySQL when running pt-query-upgrade using logs I pulled from production:

DBD::mysql::st execute failed: Unknown or incorrect time zone: 'America/Los_Angeles' [for Statement "SET time_zone = 'America/Los_Angeles'"]

SET time_zone = 'America/Los_Angeles'


The solution was to install time zones onto the testing instance like this:

mysql_tzinfo_to_sql /usr/share/zoneinfo | mysql -u root mysql


Some of the timezones didn't install but it did install for America/Lost Angelese which made my error go away.