Frequently in performance talks I hear speakers talk about the importance of using the correct datatypes for storing values in columns. i.e. representing a number with an INT
or BIGINT
, storing IP addresses as INT UNSIGNED
, and VARCHAR(60)
instead of VARCHAR(255)
.
This advice is correct, but today I thought I would try my best to go into a bit more detail đŸ™‚
The Reasons
I can think of three reasons why this optimization is true:
- Using numeric data types as strings incurs some added CPU overhead performing character-set and collation work. i.e. it’s not free to make
'Montréal' = 'Montreal' = 'MONTREAL'
, but MySQL behaves this way by default. -
Using correct data types will save space. By ‘space’ usually memory-fit is more important than disk fit, as it can improve cache efficiency. Disk fit can also be important with smaller SSDs.
-
Some wire protocol and client library buffers are not variable in length. This is a little out of scope of my knowledge area, but you should expect more memory consumption with larger
VARCHAR
values – so going to just a little bit of effort judging expected length can help.
Memory Fit
I would rate reason (2) above as the most likely reason for why this optimization is correct. In fact, I would say that a large majority of optimizations made by DBAs are to try and stretch out memory fit for as long as possible. i.e. indexes prevent table scans,
and allow you to focus on just the data you need.
Lets take a look at memory fit improvements by concentrating on data types:
- Short primary keys. There is a lot of space to be saved by keeping your
PRIMARY KEY
as short as possible. InnoDB uses a clustered index, with the value of thePRIMARY KEY
also included in each secondary index as an internal row pointer. -
Smaller values for indexed columns. Indexed values are essentially duplicated because they appear in both the index and the table row, so there is an amplification created by any inefficiency.
How much of a memory fit improvement you will gain depends on how the index is accessed. For example:
SELECT DISTINCT(indexed_col) FROM my_table
requires all of the index in memory to be efficient.- Ranged access such as
SELECT * FROM my_table WHERE indexed_col BETWEEN x AND y
may also require more memory fit if the ranges are wide. However, this can also be true for non ranged access (i.e.SELECT * FROM my_table WHERE indexed_col = x
) if indexed_col is not unique and matches a number of rows. - If the index has good cardinality (i.e. a high distribution of values), and the queries executed focus on a smaller percentage of values, then memory fit may be less important.
- Smaller values for un-indexed columns. It is also important to make sure that any un-indexed columns are small, but to a lesser degree. The case where I expect it to matter the most is if your workload includes table scans, or if there is the situation where ‘hot’ rows are highly scattered amongst a large number of pages.
Small side-note: MySQL 5.6 allows you to change the InnoDB page size, which may help improve memory fit in this case too.
That’s all I can think of. What am I missing?