How To Index For Joins With MySQL

From time to time I see people asking "What do I index to join these tables efficiently?" Often, someone just gives them an answer without really explaining the basic theory behind how to index for joins. The purpose of this document is to describe the basic theory behind indexing for joins with MySQL, starting first with a very simple example that demonstrates the basic principle of MySQL joins, and then applying the principle to a more complex 4 table query.

A Simple 3 Table Query

I want to make the test data as painlessly simple as possible so we can focus on the theory instead of trying to remember which table is which. Therefore, consider 3 tables: tblA, tblB, tblC. Each table has 3 columns: col1, col2, col3 (no, it's not normalized). It doesn't matter for now what the column types are, what the tables represent, or what kind of data this poor schema stores.

With no indexes, lets join all 3 tables as such:
   SELECT
      *
   FROM
      tblA,
      tblB,
      tblC
   WHERE
          tblA.col1 = tblB.col1
      AND tblA.col2 = tblC.col1;
   
And EXPLAIN for the query:
   +-------+------+---------------+------+---------+------+------+-------------+
   | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
   +-------+------+---------------+------+---------+------+------+-------------+
   | tblA  | ALL  | NULL          | NULL |    NULL | NULL | 1000 |             |
   | tblB  | ALL  | NULL          | NULL |    NULL | NULL | 1000 | Using where |
   | tblC  | ALL  | NULL          | NULL |    NULL | NULL | 1000 | Using where |
   +-------+------+---------------+------+---------+------+------+-------------+
   
And finally, from the MySQL manual (7.2.1):
The tables are listed in the output in the order that MySQL would read them while processing the query. MySQL resolves all joins using a single-sweep multi-join method. This means that MySQL reads a row from the first table, then finds a matching row in the second table, then in the third table, and so on. When all tables are processed, MySQL outputs the selected columns and backtracks through the table list until a table is found for which there are more matching rows. The next row is read from this table and the process continues with the next table.
As that section of the MySQL manual says, MySQL reads the first table (tblA), then the second (tblB), then the third (tblC), etc., as the tables are output in EXPLAIN. Values from the preceding tables are used to find rows in the current table. In our case, values from tblA are used to find rows in tblB, then values from tblB are used to find values in tblC. Once one full sweep is finished (found rows in tblA, tblB, tblC), MySQL does not go back to tblA, it goes back to tblB to see if there are any more rows that match the current value from tblA. If there are, it gets that row and then finds matching rows in tblC again. The important thing to remember here is the basic principle of MySQL joins:
Values from the preceding table (as listed in the output of EXPLAIN) are used to find rows in the current table.

From Principle To Index

Knowing MySQL wants to use values from tblA to find rows in tblB, how do we add an index to help MySQL? To help MySQL (or someone, or anything for that matter) we need to know what it needs. What MySQL needs in a join is a matter of how it is joined. That is, how you join tables is what MySQL needs (as terribly circular as that sounds). Consider the join of tblA and tblB: The two are joined by "tblA.col1 = tblB.col1" (theta-style join, though if we had done the query ANSI-style, the result would be the same). We already have the value for tblA.col1, so what MySQL needs is a value for tblB.col1 so it can finish the equation. Therefore, if MySQL needs tblB.col1, then we should index tblB.col1. Doing so, here is the new EXPLAIN for the same query:
   +-------+------+---------------+----------+---------+-----------+------+-------------+
   | table | type | possible_keys | key      | key_len | ref       | rows | Extra       |
   +-------+------+---------------+----------+---------+-----------+------+-------------+
   | tblA  | ALL  | NULL          | NULL     |    NULL | NULL      | 1000 |             |
   | tblB  | ref  | ndx_col1      | ndx_col1 |       5 | tblA.col1 |    1 | Using where |
   | tblC  | ALL  | NULL          | NULL     |    NULL | NULL      | 1000 | Using where |
   +-------+------+---------------+----------+---------+-----------+------+-------------+
   
As we see, MySQL now uses the ndx_col1 key to join tblB to tblA. That is, when MySQL goes looking for rows in tblB, instead of table scanning like it did before, it uses the value of tbbA.col1 with the ndx_col1 key to directly fetch matching rows. This is why the ref column for tblB says "tblA.col1". tblC is still table scanned, but that can be fixed the same way we fixed the join of tblA and tblB, by looking at what MySQL needs: From the part of the query that joins these tables, "tblA.col2 = tblC.col1," we see it needs tblC.col1 because we already have tblA.col2. Indexing this column, no surprise EXPLAIN now says:
   +-------+------+---------------+----------+---------+-----------+------+-------------+
   | table | type | possible_keys | key      | key_len | ref       | rows | Extra       |
   +-------+------+---------------+----------+---------+-----------+------+-------------+
   | tblA  | ALL  | NULL          | NULL     |    NULL | NULL      | 1000 |             |
   | tblB  | ref  | ndx_col1      | ndx_col1 |       5 | tblA.col1 |    1 | Using where |
   | tblC  | ref  | ndx_col1      | ndx_col1 |       5 | tblA.col2 |    1 | Using where |
   +-------+------+---------------+----------+---------+-----------+------+-------------+
   

The Difficult Part?

You'll probably never see a query like the one we've been using in the real world. You're more likely to see something like:
   SELECT
      COUNT(tblB.a_id) as correct,
      tblA.type,
      tblA.se_type
   FROM
      tblA,
      tblB,
      tblC,
      tblD
   WHERE
          tblA.ex_id = tblC.ex_id
      AND tblC.st_ex_id = tblB.st_ex_id
      AND tblB.q_num = tblA.q_num
      AND tblB.se_num = tblA.se_num
      AND tblD.ex_id = tblA.ex_id
      AND tblD.exp <> tblB.se_num
      AND tblB.ans = tblA.ans
      AND tblA.ex_id = 1001
      AND tblC.r_id = 542
   GROUP BY
      tblA.type,
      tblA.se_type;
   
At first that seems like a daunting query: 4 tables, an aggregate function, 9 WHERE conditions, and a GROUP BY. The great thing about EXPLAIN is that we can ignore all this for now, and simply approach it two tables at a time as we did before, determining at each step what MySQL needs. This is a real query actually, all tables and columns renamed to protect the identity of its origin. Before I started working it, EXPLAIN said:
   +-------+--------+---------------+---------+---------+---------------+-------+----------------------------------------------+
   | table | type   | possible_keys | key     | key_len | ref           | rows  | Extra                                        |
   +-------+--------+---------------+---------+---------+---------------+-------+----------------------------------------------+
   | tblA  | ALL    | NULL          | NULL    |    NULL | NULL          |  1080 | Using where; Using temporary; Using filesort |
   | tblB  | ALL    | NULL          | NULL    |    NULL | NULL          | 87189 | Using where                                  |
   | tblC  | eq_ref | PRIMARY       | PRIMARY |       4 | tblB.st_ex_id |     1 | Using where                                  |
   | tblD  | eq_ref | PRIMARY       | PRIMARY |       4 | tblA.ex_id    |     1 | Using where                                  |
   +-------+--------+---------------+---------+---------+---------------+-------+----------------------------------------------+
   
First a word on determining the impact of a join: Result sets. A result set is, obviously, the set of results from a query. For joins, an estimated way to figure this is to multiple the number of rows MySQL predicts it will need to read for each table. As an estimate, this is more towards the worst case end of the scale, since other WHERE conditions will almost always dramatically reduce the real number of rows the query produces. But for this query, the result set is 94 million rows. This is why indexless joins are so dangerous; a few thousand rows times a few thousand more and you're up in millions already.

Now what does this query need? Start with tblA and tblB. Find in the query where these tables are joined:
AND tblB.q_num = tblA.q_num
AND tblB.se_num = tblA.se_num
AND tblB.ans = tblA.ans
MySQL needs at least one of q_num, se_num, or ans. I chose to index se_num and q_num because among all the other queries I was looking at, these columns were most often needed. Compromise is part of optimization; most professionals don't have the time to find the absolute best indexes for every single query, instead you find the best for the most common usage patterns. As we'll see later, this still pays off tremendously as, in the example of this query, we turn its performance around as night and day. Indexing (se_num, q_num) on tblB, EXPLAIN now says:
   +-------+--------+---------------+-------------+---------+------------------------+------+----------------------------------------------+
   | table | type   | possible_keys | key         | key_len | ref                    | rows | Extra                                        |
   +-------+--------+---------------+-------------+---------+------------------------+------+----------------------------------------------+
   | tblA  | ALL    | NULL          | NULL        |    NULL | NULL                   | 1080 | Using where; Using temporary; Using filesort |
   | tblB  | ref    | ndx_secn_qn   | ndx_secn_qn |       2 | tblA.se_num,tblA.q_num |  641 | Using where                                  |
   | tblC  | eq_ref | PRIMARY       | PRIMARY     |       4 | tblB.st_ex_id          |    1 | Using where                                  |
   | tblD  | eq_ref | PRIMARY       | PRIMARY     |       4 | tblA.ex_id             |    1 | Using where                                  |
   +-------+--------+---------------+-------------+---------+------------------------+------+----------------------------------------------+
   
The result set of the query has fallen 99.3% to 692,280 rows. But why stop there? We can easily get rid of tblA's table scan. Since it's the first table, we're not really indexing for a join because we just did that for the join of tblB to tblA. More often, indexing for the first table in a join is just like indexing for that table if it were all alone. In which case, you look at the query to see if the table is being restricted in any way. In this case we get lucky and tblA is: "AND tblA.ex_id = 1001". As we learned way back in Optimizing Queries Case 1: Basic Indexes, all we have to do is index column ex_id:
   +-------+--------+---------------+-------------+---------+------------------------+------+----------------------------------------------+
   | table | type   | possible_keys | key         | key_len | ref                    | rows | Extra                                        |
   +-------+--------+---------------+-------------+---------+------------------------+------+----------------------------------------------+
   | tblA  | ref    | ndx_ex_id     | ndx_ex_id   |       4 | const                  |    1 | Using where; Using temporary; Using filesort |
   | tblB  | ref    | ndx_secn_qn   | ndx_secn_qn |       2 | tblA.se_num,tblA.q_num |  641 | Using where                                  |
   | tblC  | eq_ref | PRIMARY       | PRIMARY     |       4 | tblB.st_ex_id          |    1 | Using where                                  |
   | tblD  | eq_ref | PRIMARY       | PRIMARY     |       4 | tblA.ex_id             |    1 | Using where                                  |
   +-------+--------+---------------+-------------+---------+------------------------+------+----------------------------------------------+
   
Now the query's result set is 641 rows! Down from 94 million, you can practically call that a 100% decrease. Studying the query further, we may be able to get rid of the temp table and filesort, but overall the query is profoundly faster, and has served the purpose of demonstrating how to index for joins well. Despite initially looking challenging, we saw that if you take it two tables at a time, and isolate and index what MySQL needs, indexing for joins isn't so difficult.

Conclusion

Making simple work of complex joins and knowing where to index for joins is a matter of realizing two things:
  1. No matter how complex the query, you only have to approach the join two tables at a time in the order the tables are listed by EXPLAIN.
  2. Values from the preceding table are already found; the work lies in helping MySQL use these values with indexes on the current table to find rows.