Monday, December 4, 2017

Rewriting sub queries to use joins for better performance

For almost every client I have worked with, I have done some amount of profiling on their databases and recommend minor changes to improve performance. Almost every single time I see queries that use sub-selects when it is not necessary. When the data sets are small (in the thousands or less), sub selects performance is typically not an issue. However, when the number of rows in the tables get into the hundreds of thousands to millions, sub-select performance usually tanks. I've seen developers that treat MySQL tables like a queue (not a best practice) and use sub-selects. Performance will appear to be fine until the queue fills up and all of sudden the entire application is broken because they were not expecting 500,000 to a million rows in the table. Sometimes these poorly written sub-select queries will take 2~3 minutes and pile up on each other causing the MySQL server to be overwhelmed and grind to a halt.

A performance gain can almost always be gained by re-writing a query to use a join instead of a sub-select. Additionally, you need to make sure that the columns which are being used for the join have an index. If you read the "High Performance MySQL" book by the experts at Percona you will see this is one of their recommendations.

Here is a simple example I wrote to query a table that contains a list of database servers.

This query uses a sub-select. In the Explain plan's extra column, notice Using temporary; Using filesort and the the type column shows ALL meaning it is looking at all rows in the table:


This query uses a JOIN without the sub-select. If you know how to read MySQL explain plans, this one looks much better! The type columns changes to range meaning the query is no longer doing a full table scan and the filesort and temporary went away.


The data size on this table is pretty small (less than 2,000 rows). The query with a sub-select took about 65 ms while the query with the join took 50 ms.

Your mileage with re-writing queries to use a join will vary based on the data distribution, primary key, aggregation, indexes and such but the larger the table gets, a join will usually give you better performance than a sub-select. I've seen queries go from minutes to seconds or from 5 sec to 2 sec after re-writing them use a join. 


No comments:

Post a Comment