I accidentally stumbled upon this Stack Overflow question this morning:
I am wondering if there is any difference in regards to performance between the following:
SELECT ... FROM ... WHERE someFIELD IN(1,2,3,4); SELECT ... FROM ... WHERE someFIELD between 0 AND 5; SELECT ... FROM ... WHERE someFIELD = 1 OR someFIELD = 2 OR someFIELD = 3 ...;
It is an interesting question because there was no good way to answer it when it was asked in 2009. All of the queries resolve to the same output in EXPLAIN. Here is an example using the sakila schema:
mysql> EXPLAIN SELECT * FROM film WHERE film_id BETWEEN 1 AND 5\G mysql> EXPLAIN SELECT * FROM film WHERE film_id IN (1,2,3,4,5)\G mysql> EXPLAIN SELECT * FROM film WHERE film_id =1 or film_id=2 or film_id=3 or film_id=4 or film_id=5\G ********* 1. row ********* id: 1 select_type: SIMPLE table: film partitions: NULL type: range possible_keys: PRIMARY key: PRIMARY key_len: 2 ref: NULL rows: 5 filtered: 100.00 Extra: Using where
Times have changed though. There are now a couple of useful features to show the difference 🙂
Optimizer Trace
Optimizer trace is a new diagnostic tool introduced in MySQL 5.6 to show how the optimizer is working internally. It is similar to EXPLAIN, with a few notable differences:
- It doesn’t just show the intended execution plan, it shows the alternative choices.
- You enable the optimizer trace, then you run the actual query.
- It is far more verbose in its output.
Here are the outputs for the three versions of the query:
- SELECT * FROM film WHERE film_id BETWEEN 1 AND 5
- SELECT * FROM film WHERE film_id IN (1,2,3,4,5)
- SELECT * FROM film WHERE film_id =1 or film_id=2 or film_id=3 or film_id=4 or film_id=5
What is the difference?
The optimizer trace output shows that the first query executes as one range, while the second and third execute as 5 separate single-value ranges:
"chosen_range_access_summary": { "range_access_plan": { "type": "range_scan", "index": "PRIMARY", "rows": 5, "ranges": [ "1 <= film_id <= 1", "2 <= film_id <= 2", "3 <= film_id <= 3", "4 <= film_id <= 4", "5 <= film_id <= 5" ] }, "rows_for_plan": 5, "cost_for_plan": 6.0168, "chosen": true }
This can also be confirmed with the handler counts from SHOW STATUS:
BETWEEN 1 AND 5: Handler_read_key: 1 Handler_read_next: 5 IN (1,2,3,4,5): Handler_read_key: 5 film_id =1 or film_id=2 or film_id=3 or film_id=4 or film_id=5: Handler_read_key: 5
So I would say that BETWEEN 1 AND 5 is the cheapest query, because it finds one key and then says next, next, next until finished. The optimizer seems to agree with me. A single range access plus next five times costs 2.0168 instead of 6.0168:
"chosen_range_access_summary": { "range_access_plan": { "type": "range_scan", "index": "PRIMARY", "rows": 5, "ranges": [ "1 <= film_id <= 5" ] }, "rows_for_plan": 5, "cost_for_plan": 2.0168, "chosen": true } }
For context, a cost unit is a logical representation of approximately one random IO. It is stable to compare costs between different execution plans.
Ranges are not all equal
Perhaps a better example to demonstrate this, is the difference between these two ranges:
- SELECT * FROM film WHERE film_id BETWEEN 1 and 20
- SELECT * FROM film WHERE (film_id BETWEEN 1 and 10) or (film_id BETWEEN 911 and 920)
It's pretty obvious that the second one needs to execute in two separate ranges. EXPLAIN will not show this difference, and both queries appear the same:
********* 1. row ********* id: 1 select_type: SIMPLE table: film partitions: NULL type: range possible_keys: PRIMARY key: PRIMARY key_len: 2 ref: NULL rows: 20 filtered: 100.00 Extra: Using where
Two distinct separate ranges may be two separate pages, and thus have different cache efficiency on the buffer pool. It should be possible to distinguish between the two.
EXPLAIN FORMAT=JSON
EXPLAIN FORMAT=JSON was introduced in MySQL 5.6 along with OPTIMIZER TRACE, but where it really becomes useful is MySQL 5.7. The JSON output will now include cost information (as well as showing separate ranges as attached_condition):
********* 1. row ********* EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "10.04" }, "table": { "table_name": "film", "access_type": "range", "possible_keys": [ "PRIMARY" ], "key": "PRIMARY", "used_key_parts": [ "film_id" ], "key_length": "2", "rows_examined_per_scan": 20, "rows_produced_per_join": 20, "filtered": "100.00", "cost_info": { "read_cost": "6.04", "eval_cost": "4.00", "prefix_cost": "10.04", "data_read_per_join": "15K" }, "used_columns": [ "film_id", "title", "description", "release_year", "language_id", "original_language_id", "rental_duration", "rental_rate", "length", "replacement_cost", "rating", "special_features", "last_update" ], "attached_condition": "((`film`.`film_id` between 1 and 10) or (`film`.`film_id` between 911 and 920))" } } }
With the FORMAT=JSON output also showing cost, we can see that two ranges costs 10.04, versus one big range costing 9.04 (not shown). These queries are not identical in cost even though they are in EXPLAIN output.
Conclusion
I have heard many users say "joins are slow", but a broad statement like this misses magnitude. By including the cost information in EXPLAIN we get all users to speak the same language. We can now say "this join is expensive", which is a much better distinction 🙂
It is time to start using OPTIMIZER TRACE, and particularly in 5.7 ditch EXPLAIN for EXPLAIN FORMAT=JSON.