Fastest way to estimate rows in a table

A friend wrote to me recently with a question. He was working on a method to ship application metrics to statsd on a 1 minute interval. He had three examples of how to estimate the number of rows in a table and he wanted to know the difference between them.

Data length/average row length

{% raw %}
The example given:

mysql> select DATA_LENGTH/AVG_ROW_LENGTH from INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'line_items';
+----------------------------+
| DATA_LENGTH/AVG_ROW_LENGTH |
+----------------------------+
|              10497541.7528 |
+----------------------------+
1 row in set (0.03 sec)

I have actually never thought of using this method! I don’t think it’s accurate though, since data length has deleted space + additional preallocated or overhead space. For example a page file is only 15/16ths in InnoDB. So as you can see the number it returns is just over 10% higher than the actual number of rows (9441296).

Table rows from Information Schema

The example given:

mysql> SELECT TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'line_items';
+------------+
| TABLE_ROWS |
+------------+
|    9800078 |
+------------+
1 row in set (0.03 sec)

This method takes a number that InnoDB provides, which in this case is accurate to within 4% of the actual number of rows (estimating slightly over in this case). So the question is really about how efficient a count already provided is, and where does it come from.
InnoDB samples a number of random pages in the table, and then estimates the total rows for the whole table. By default this sampling is just 8 pages in MySQL 5.5 (configurable via innodb_stats_sample_pages), but is greatly improved in MySQL 5.6 – with 20 pages sampled by default. The option is now called innodb_stats_persistent_sample_pages – a reference to the new persistent statistics feature!
So based on it being a fixed number of pages to examine, it is also going to scale reasonably with table growth. Pro tip: It is quite possible it may look much slower as soon as the table does not fit in memory, since 8 random pages could mean > 8 random IOs.

Select count(1)

The example given:

mysql> SELECT COUNT(1) FROM line_items;
+----------+
| COUNT(1) |
+----------+
|  9441296 |
+----------+
1 row in set (2.03 sec)

This requires an index scan of the primary key. It’s important to explain why that is, since this behavior differs from MyISAM. InnoDB supports MVCC which is an important feature to allow concurrent access to rows without having to need readers set locks blocking other users from writing. In a practical sense what this feature means, is that at any one point in time there will be multiple versions of a row. The actual count(1) will depend on the time your transaction started, and its isolation level.
This solution will not scale well as the number of rows in the table grows and while storage-engine development is outside of my expertise, I suspect it is unlikely that this will be improved in any future MySQL versions. My reasoning is that I can not see an easy way of maintaining multiple different pre-computed counts without introducing any new global locking or overhead – which is a big no-no in being able to scale on multiple cores/cpus.
Finally, of the three solutions, this is the only 100% accurate method to be able to tell the exact number of rows in the table.
{% endraw %}