Back of the Envelope Calculations

One of the best DBA questions I was ever asked in an interview, was to try and estimate required disk space for an application using a back-of-the-envelope calculation.

The idea behind a question like this, is that it is a thought exercise. Its aim is to better test competency than some of the simple questions that can be Googled for an answer (i.e. whether or not you know the answer is not important. You will in 2 minutes :)) Unfortunately I have also seen it negatively affect candidates that are not native English speakers, so be aware that your mileage may vary.

Today I wanted to try and recreate the back-of-the-envelope exercise that I was asked to complete. For simplicity the question will be restricted to database size and not include RAM or IOPS estimations.

Scope of Problem

“We are designing a new system to store application users for a very demanding system. There is only one table, but it is expected to grow to 20 million records within the next 3 years. We need your help with hardware provisioning. Here is the table schema:”

CREATE TABLE users (
 id INT NOT NULL PRIMARY KEY auto_increment,
 first_name VARCHAR(60) NOT NULL,
 last_name VARCHAR(60) NOT NULL,
 user_name VARCHAR(30) NOT NULL,
 password BINARY(20) NOT NULL,
 email_address VARCHAR(100) NOT NULL,
 last_login_date TIMESTAMP NOT NULL,
 UNIQUE INDEX (user_name),
 INDEX (last_login_date),
 INDEX (last_name, first_name),
 INDEX (email_address)
) ENGINE=INNODB;

Assumptions you are allowed to make:
* Database is UTF-8.
* Average first name and last name is 10 characters / mostly single byte characters.
* Usernames average 8 characters in length / all single byte characters.
* Assume password is a SHA1 hash.
* Average email address is 20 characters / all single byte characters.
* Users are almost never deleted, and rarely change first_name, last_name or user_name after being created.

My Answer

The first step I would go to, is to find out the size of the datatypes in the table, and in the indexes. For the VARCHAR columns I will use the average lengths provided + 1 byte, and for other columns I can use the MySQL manual. So:

4 + 10+1 + 10+1 + 8+1 + 20 + 20+1 + 4 = 80 bytes per row.

Each index will have the length requirement of the data types in the index + the primary key:

             username: 8+1 + 4 = 13
      last_login_date: 4 + 4 = 8
last_name, first_name: 10+1 + 10+1 + 4 = 26
        email_address: 20+1 + 4 = 25

InnoDB also has some approximately 13 bytes of meta data attached to each row (version-id and rollback pointer, deleted flag). Indexes then fit into pages, which have additional overhead themselves. Lets try and account for page overhead with a 7% space premium (based on 15/16ths fill-factor):

              PRIMARY: (80 + 13) * 1.07 = 99.51
             username: (13 + 13) * 1.07 = 27.82
      last_login_date: (8+ 13) * 1.07 = 22.47
last_name, first_name: (26 + 13) * 1.07 = 41.73
        email_address: (25 + 13) * 1.07 = 40.66

I would consider the space premium of 7% optimistic, since this is the rate of new pages created. Over time pages will split as insertions are out of order and records are deleted. Secondary indexes also feature all MVCC values so they can ‘bloat’ via gaps created after modifications. I’m not sure of a good way to account for this, but I’m going to allocate another 25% on the indexes that I expect this will happen to:

              PRIMARY: 99.51 * 1 ~= 100 # Not affected
             username: 27.82 * 1.25 ~= 35
      last_login_date: 22.47 * 1.25 ~= 28
last_name, first_name: 41.73 * 1.25 ~= 52
        email_address: 40.66 * 1.25 ~= 51

So the final number I arrived at per row is:

100 + 35 + 28 + 52 + 51 = 266 bytes

Multiplied by 20 million this equals 4.95 GB ~= 5GB

Now it’s time to factor in some global overhead that is somewhat fixed:

  • Lets assume innodb_log_file_size is 256M * 2 = 512MB. On a bigger system you can now make this much larger in MySQL 5.6.
  • Data dictionary / insert buffer / doublewrite buffer. These are all going to be very small, since I would expect very little need for insert buffer on a 5GB table = 12MB.
  • UNDO information is probably going to be small for this table, since I would typically expect transactions to be short and modify 1 user at a time. I am not sure how to estimate this exactly, so I am going to budget 500MB.
  • I also have no way to estimate binary logs, and will budget 5GB for the purpose (defaults to 1GB before rotation, should be plenty of space. Largest concern I have is how frequently the last_login_date column is updated, and how much traffic it generates.)

Which equals 6GB of overhead.

The last thing to plan for is free space so we will be able to run an ALTER TABLE operation and not run out of disk space 😉

Typically I like to provision 3x on total database size so I will always be able to stage a backup locally if I need to and still have some juggle room and not trigger Nagios free space alarms. With a database the size of this system, I probably will be able to use the 3x rule. When 3x is not practical to be able to provision for, I will normally settle for 2x globally and 3x of the largest table (which in this particular case there is only 1 table in the whole system).

So with that math, my final recommendation would be:

5GB + 6GB = 11GB * 3 = 33G

Did I pass the test?
How would you answer the question?

If someone writes an answer that you like, please give them an upvote in the comments!

Published by

morgo

I joined MySQL AB in 2006, left, and am now back at Oracle working on the MySQL team. I’ve also worked at Percona and InPowered.

  • Gavin Towey

    Great answer — though I think the only other things to add is when you start talking about the practical aspects of the hardware buying process.

    If the maximum size you’ll likely need *after* 3 years is 33G, then you could actually stand to start out with a much *lower* space available. Buying a new server after 3 years is not unreasonable.

    There’s also the issue that provisioning a server with a several hundred GB of fast raid 10 drives, or even fast SSDs isn’t very expensive so you’re likely going to buy something which has far more space available than that.

    Unless of course the QPS needed to support that dataset is especially low and you want to save money by not having a dedicated DB server for this.