MySQL is one of the most used databases in the current world. There has been a lot of research and thought gone into making the indexes in MySQL + InnoDB performant keeping the space complexity of these indexes in the same ballpark.
In the context of this blog post, we will be going through the different index mechanisms provided by MySQL with InnoDB as the storage engine. We will also be understanding the best practices associated with these indexes so that we are able to take advantage of these indexes.
Let’s start by creating a table for storing the name of the students.
- ID has been declared as PRIMARY KEY
- EMAIL has been declared as a UNIQUE KEY
mysql> CREATE TABLE students (id INT PRIMARY KEY, name VARCHAR(20), email VARCHAR(20) UNIQUE, age INTEGER);
MySQL maintains different indexes for the PRIMARY KEY id and UNIQUE KEY email. For every entry which we are going to insert into this table, MySQL server will add the entry for the record in both of these indexes. Further down in the blog, we will talk about the data structures which the InnoDB uses to implement these indexes.
Writes/Inserts requests issued by the client are added by the MySQL to the clustered index.
A Clustered index is essentially a B+Tree which contains the actual records of the table on the leaf nodes. Any record which is added to the table is actually stored in the leaf nodes of the clustered index.
From MySQL Official documentation
When you define a
PRIMARY KEYon your table,
InnoDBuses it as the clustered index. Define a primary key for each table that you create. If there is no logical unique and non-null column or set of columns, add a new auto-increment column, whose values are filled in automatically.
If you do not define a
PRIMARY KEYfor your table, MySQL locates the first
UNIQUEindex where all the key columns are
InnoDBuses it as the clustered index.
If the table has no
PRIMARY KEYor suitable
InnoDBinternally generates a hidden clustered index named
GEN_CLUST_INDEXon a synthetic column containing row ID values. The rows are ordered by the ID that
InnoDBassigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.
Note: It is often advised that one should create clustered indexes on columns whose values are monotonically increasing with respect to time.
Why is that so ???
Because if we have a column whose values are not monotonically increasing with respect to time, then the data/entries for those respective values might be inserted randomly across this whole clustered index. This then leads to data fragmentation and hence increased disk usage along with bad query performance.
Suppose we have a clustered index on AGE, then the new entries which might get inserted in the leaf pages of the Btree will be distributed all across the index. Hence bad write performance and increased disk usage. As you can see in this below graph:
All other indexes apart from this clustered index in the table which
If we create an index on the column AGE, then MySQL creates a BTree out of the values of the AGE column.
- Index nodes of the B-tree contain the range of the AGE column value of the records underneath.
- Leaf nodes of the B-tree contain the actual AGE of each and every record as well as the ID of the record i.e. Primary key of the Record.
With this current indexes on ID and secondary index on the AGE column, lets understand how would the inserts and the query flow would look like.
- Records are always inserted in the clustered index. Firstly we figure out the leaf nodes in which the record should go into. After figuring out the page/leaf node, we insert the record in that page.
- There might be some spillover of the existing records of the leaf nodes to a new page/leaf nodes.
- Due to some random entries getting inserted in already filled leaf nodes, we might experience fragmentation of the leaf nodes and hence some increased disk usage.
- After record is inserted in the clustered index, its corresponding entry is made in the secondary index <ID = 100, Age = 28>. This is again done in a similar fashion. We firstly figure out the leaf node by traversing through the BTree looking where could be the entry for “Age = 28” would land and then insert this entry <ID = 100, Age = 28>.
- Note: ID is the primary key of the table on which clustered index is created. So along with the AGE, we store the ID of the record in the secondary index of the table so that for a record in the secondary index <AGE, ID>, we can figure out the full original record from the clustered index with the ID column.
Query Semantics are a bit non trivial because MySQL does a lot of optimizations and rewriting of the query to make sure we are able to execute the query in the most efficient way possible. But lets for the scope of this article take simple cases and understand how does the query behaviour works with MySQL with given index structure.
SELECT * from students where age > 20
In this query, MySQL understands that we do have a secondary index on the age column. Hence MySQL firstly traverses through this secondary index i.e. AGE column to figure out the leaf nodes which contain primary keys of the applicable records. After getting those values of the primary keys for those applicable records, it then issues another query on the clustered index i.e. Primary Key ID with those particular keys and retrieves those records from the clustered index.
Essentially it took us 2 BTree traversals
- One traversal of the secondary Index
- One traversal of the primary Index
to get us to the records which would be applicable for the query
Secondary Index is a powerful tool provided by MySQL to ensure better query performance over some of the columns at the expense of some performance hit while inserting those records.
However, we need to be really careful while creating those secondary indexes. In case of a large number of secondary indexes, they might end up increasing the insertion latencies by a huge margin. So until and unless one is really sure that he wants an enhanced performance on one of the columns, one shouldn’t be creating the secondary index.
Creating Single Secondary Index over Multiple Columns
MySQL provides creating secondary indexes over a set of columns. So instead of having 3 different indexes for 3 different columns, we can just have a single index for all of these columns saving us insertion latencies as well as disk + cache space.
Let’s take for example we want to have enhanced performance over these two columns
So we have two options for getting enhanced performance over these 2 columns.
- Creating separate secondary indexes for both of these columns
- Insertion Latencies would increase by a huge margin, as now we are inserting every record into 3 indexes, so hence 3 Btree traversals
- Increased Disk + OS Cache Usage
- We can query on either Age or Email or both and we will get enhanced performance.
- Creating single combined secondary index <Age, Email> for both of these columns.
- Insertions Latencies will increase but would be lower in comparison to the first approach.
- Lower Disk + OS Cache Usage when compared to the first approach
- However, we can query only on <Age, Email> or <Age> to ensure query hits the secondary index. In the case of only <Email>, the query does not hit the secondary index and hence we have to do a full table scan 🙁
So it is really advised that unless you really need to query on <Email> column separately i.e. without the <Age> column, we shouldn’t be creating a secondary index on <Email> column and instead only create one on <Age, Email>.
SELECT * FROM students WHERE EMAIL = "firstname.lastname@example.org"
However if there is a strong need to query on <Email> column, there is one more way in which we can make sure query hits the secondary index <Age, Email> instead of creating an entirely new secondary index on <Email>. For this, you just need to make sure to use the keyword “IN” in your MySQL query for <Age> Column. By specifying the explicit values for the <Age> column via IN keyword in where clause and along with the <Email> clause, we make sure that this query hits the secondary index.
SELECT * FROM students WHERE EMAIL = "email@example.com" AND AGE IN (19, 20, 21, 22, 23, 24, 25, 26)
But please don’t think that we can use this trick always. This is because in our case the cardinality of the <AGE> column is not that much. We already know that all the records in the table have <AGE> from 19 to 26 and hence this optimization would work. But in case of large cardinality columns, this optimization would not often result in better results and might even degrade the performance.
One more important thing to note is that we were able to solve all the use cases for <Age>, <Email>, <Age, Email> via a single Secondary index on <Age, Email>. Had it been <Email, Age>, we might not have been able to solve for all use cases i.e. <Age> use case. This is because of the really high cardinality of the email column. Hence ordering of the columns in the secondary index is of utmost importance.
Now lets try to generalise this ordering of columns for a secondary index
Given 3 Columns <C1, C2, C3>, we need to create the secondary index on all of these columns and all of the use cases are important. Assuming C1 is lower cardinal when compared to C2 and C2 is lower cardinal when compared to C3, I would suggest having the secondary index order as <C1, C2, C3>. In this indexing scheme, we can query on <C2> as well <C3> by explicitly specifying the values for <C1> or <C1, C2> respectively assuming the number of distinct values for <C1> or <C1, C2> are on the lower side.