I thought I would take the new subquery optimizations in MySQL 5.6 for a spin today, using the world sample database provided by MySQL for certification and training.
Typical IN subquery
This is a very typical query developers run, which historically has performed very poorly on MySQL:
mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT code FROM Country WHERE name = 'United States');
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: City
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 3984
Extra: Using where
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: Country
type: unique_subquery
possible_keys: PRIMARY
key: PRIMARY
key_len: 3
ref: func
rows: 1
Extra: Using where
2 rows in set (0.00 sec)
mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT code FROM Country WHERE name = 'United States');
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Country
type: ALL
possible_keys: PRIMARY
key: NULL
key_len: NULL
ref: NULL
rows: 239
Extra: Using where
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: City
type: ref
possible_keys: CountryCode
key: CountryCode
key_len: 3
ref: world.Country.Code
rows: 1
Extra: NULL
2 rows in set (0.00 sec)
Notice that in MySQL 5.6 – the very first table accessed is Country instead of City. MySQL 5.5 was not able to recognize this as a constant, and instead executed this as a DEPENDENT SUBQUERY (aka Correlated subquery) for each row it found in the city table (an estimated 3984 rows)!
MySQL 5.6 still has a table scan on Country, but I can address that with an index on Country.name:
mysql5.5.31 > ALTER TABLE Country ADD INDEX (name);
Query OK, 0 rows affected (0.04 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT code FROM Country WHERE name = 'United States')\G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: City
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 3984
Extra: Using where
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: Country
type: unique_subquery
possible_keys: PRIMARY,Name
key: PRIMARY
key_len: 3
ref: func
rows: 1
Extra: Using where
2 rows in set (0.00 sec)
mysql5.6.11 > ALTER TABLE Country ADD INDEX (name);
Query OK, 0 rows affected (0.04 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT code FROM Country WHERE name = 'United States')\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Country
type: ref
possible_keys: PRIMARY,Name
key: Name
key_len: 52
ref: const
rows: 1
Extra: Using where; Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: City
type: ref
possible_keys: CountryCode
key: CountryCode
key_len: 3
ref: world.Country.Code
rows: 1
Extra: NULL
2 rows in set (0.00 sec)
The index doesn’t affect MySQL 5.5 – which still executes as a DEPENDENT SUBQUERY, but take a look at MySQL 5.6 – 1 row from the Country table (from an index!) and then 1 row from the City table. This optimizes great!
More complex IN example
In this example I thought I would try to find all cities in the country with the largest population. My first attempt was to see if I could now use a LIMIT in a subquery. It looks like I’ll have to wait a bit longer:
mysql5.5.31 > select * from City WHERE CountryCode IN (SELECT Code FROM country order by population desc limit 1);
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
mysql5.6.11 > select * from City WHERE CountryCode IN (SELECT Code FROM country order by population desc limit 1);
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
So here is my second attempt:
mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: City
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 3984
Extra: Using where
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: country
type: unique_subquery
possible_keys: PRIMARY
key: PRIMARY
key_len: 3
ref: func
rows: 1
Extra: Using where
*************************** 3. row ***************************
id: 3
select_type: SUBQUERY
table: Country
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 219
Extra:
3 rows in set (0.00 sec)
mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: country
type: ALL
possible_keys: PRIMARY
key: NULL
key_len: NULL
ref: NULL
rows: 239
Extra: Using where
*************************** 2. row ***************************
id: 1
select_type: PRIMARY
table: City
type: ref
possible_keys: CountryCode
key: CountryCode
key_len: 3
ref: world.country.Code
rows: 1
Extra: NULL
*************************** 3. row ***************************
id: 3
select_type: SUBQUERY
table: Country
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 239
Extra: NULL
3 rows in set (0.00 sec)
MySQL 5.5 could always optimize the population = scalar subquery, but not the IN subquery. Similar to the above example, I would expect the subqueries here should be unraveled as constants as well. If I add an index on population you can really see this happen:
mysql5.5.31 > ALTER TABLE Country ADD INDEX (population);
Query OK, 0 rows affected (0.05 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: City
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 3984
Extra: Using where
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: country
type: unique_subquery
possible_keys: PRIMARY,Population
key: PRIMARY
key_len: 3
ref: func
rows: 1
Extra: Using where
*************************** 3. row ***************************
id: 3
select_type: SUBQUERY
table: NULL
type: NULL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra: Select tables optimized away
3 rows in set (0.00 sec)
mysql5.6.11 > ALTER TABLE country add index (population);
Query OK, 0 rows affected (0.05 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql5.6.11 > EXPLAIN select * from City WHERE CountryCode IN
(SELECT Code FROM country where population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: country
type: ref
possible_keys: PRIMARY,Population
key: Population
key_len: 4
ref: const
rows: 1
Extra: Using where; Using index
*************************** 2. row ***************************
id: 1
select_type: PRIMARY
table: City
type: ref
possible_keys: CountryCode
key: CountryCode
key_len: 3
ref: world.country.Code
rows: 1
Extra: NULL
*************************** 3. row ***************************
id: 3
select_type: SUBQUERY
table: NULL
type: NULL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra: Select tables optimized away
3 rows in set (0.00 sec)
This is looking really good in MySQL 5.6. I had a bit of a huh? moment when trying to read what the Select tables optimized away
step #3 meant. This led me to try using EXPLAIN EXTENDED where I discovered a little gem:
mysql5.5.31 > SHOW WARNINGS\G
*************************** 1. row ***************************
Level: Note
Code: 1003
Message: select `world`.`city`.`ID` AS `ID`,`world`.`city`.`Name` AS `Name`,
`world`.`city`.`CountryCode` AS `CountryCode`,`world`.`city`.`District` AS `District`,
`world`.`city`.`Population` AS `Population` from `world`.`city` where
<in_optimizer>(`world`.`city`.`CountryCode`,
<exists>(<primary_index_lookup>(<cache>(`world`.`city`.`CountryCode`) in
country on PRIMARY where ((`world`.`country`.`Population` =
(select max(`world`.`country`.`Population`) from `world`.`country`)) and
(<cache>(`world`.`city`.`CountryCode`) = `world`.`country`.`Code`)))))
1 row in set (0.00 sec)
mysql5.6.11 > show warnings\G
*************************** 1. row ***************************
Level: Note
Code: 1003
Message: /* select#1 */ select `world`.`city`.`ID` AS `ID`,`world`.`city`.`Name` AS `Name`,
`world`.`city`.`CountryCode` AS `CountryCode`,`world`.`city`.`District` AS `District`,
`world`.`city`.`Population` AS `Population` from `world`.`country` join `world`.`city`
where ((`world`.`city`.`CountryCode` = `world`.`country`.`Code`)
and (`world`.`country`.`Population` = (/* select#3 */
select max(`world`.`country`.`Population`) from `world`.`country`)))
1 row in set (0.00 sec)
EXPLAIN EXTENDED
writes the approximate query that MySQL is going to execute after the optimizer has applied any optimizations or transformations. This has been enhanced in MySQL 5.6 to add a comment for each step in the query execution (IDs match up to those in EXPLAIN). So if it was ever unclear, it is clearly this portion that is being optimized away:
mysql5.5.31 > EXPLAIN select max(`world`.`country`.`Population`) from `world`.`country`\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: NULL
type: NULL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra: Select tables optimized away
1 row in set (0.00 sec)
mysql5.6.11 > EXPLAIN select max(`world`.`country`.`Population`) from `world`.`country`\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: NULL
type: NULL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra: Select tables optimized away
1 row in set (0.00 sec)
I believe what’s happening here, is during optimization MySQL opens the index population and looks at the last value (very cheap on a B-tree). So it kind of cheats and does some of the work before it has to. I’ve seen it do this before, here is a more common example of this cheating happening:
mysql5.5.31 > EXPLAIN EXTENDED SELECT * FROM City WHERE id = 1890\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: City
type: const
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 1
filtered: 100.00
Extra:
1 row in set, 1 warning (0.00 sec)
mysql5.5.31 > show warnings\G
*************************** 1. row ***************************
Level: Note
Code: 1003
Message: select '1890' AS `ID`,'Shanghai' AS `Name`,'CHN' AS `CountryCode`,
'Shanghai' AS `District`,'9696300' AS `Population` from `world`.`city` where 1
1 row in set (0.00 sec)
mysql5.6.11 > EXPLAIN EXTENDED SELECT * FROM City WHERE id = 1890\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: City
type: const
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 1
filtered: 100.00
Extra: NULL
1 row in set, 1 warning (0.00 sec)
mysql5.6.11 > show warnings\G
*************************** 1. row ***************************
Level: Note
Code: 1003
Message: /* select#1 */ select '1890' AS `ID`,'Shanghai' AS `Name`,'CHN' AS
`CountryCode`,'Shanghai' AS `District`,'9696300' AS `Population` from `world`.`city`
WHERE 1
1 row in set (0.00 sec)
Anyway, back to my original query. With the nesting of my IN queries I sometimes find it difficult to read the output of EXPLAIN and understand the order of execution. MySQL 5.6 also has FORMAT=JSON
, which looks much nicer to me and it includes more information:
mysql5.6.11 > EXPLAIN format=json select * from City WHERE CountryCode IN
(SELECT Code FROM country where population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
EXPLAIN: {
"query_block": {
"select_id": 1,
"nested_loop": [
{
"table": {
"table_name": "country",
"access_type": "ref",
"possible_keys": [
"PRIMARY",
"Population"
],
"key": "Population",
"used_key_parts": [
"Population"
],
"key_length": "4",
"ref": [
"const"
],
"rows": 1,
"filtered": 100,
"using_index": true,
"attached_condition": "(`world`.`country`.`Population` = (/* select#3 */
select max(`world`.`country`.`Population`) from `world`.`country`))",
"attached_subqueries": [
{
"dependent": false,
"cacheable": true,
"query_block": {
"select_id": 3,
"table": {
"message": "Select tables optimized away"
}
}
}
]
}
},
{
"table": {
"table_name": "City",
"access_type": "ref",
"possible_keys": [
"CountryCode"
],
"key": "CountryCode",
"used_key_parts": [
"CountryCode"
],
"key_length": "3",
"ref": [
"world.country.Code"
],
"rows": 1,
"filtered": 100
}
}
]
}
}
1 row in set, 1 warning (0.00 sec)
While it’s possible that these queries could have been rewritten to be efficient joins, I really like seeing query optimizations being introduced to eliminate common paper cuts. Improving diagnostic features doesn’t hurt either 😉 I’m really looking forward to what tools can be built to take advantage of the JSON explain output.