myesn

myEsn2E9

hi
github

MySQL Union Index

Introduction#

An index built on multiple columns is called a composite index.

For example, combining the product_no and name fields in the product table to create a composite index (product_no, name).

CREATE INDEX index_product_no_name ON product(product_no, name);

The following figure shows a B+Tree diagram of the composite index (product_no, name) (the leaf nodes are drawn as singly linked lists, but they are actually doubly linked, the image is incorrect):
image

The non-leaf nodes of the composite index use the values of two fields as the key values of the B+Tree.

When querying data using a composite index, it first compares based on product_no, and then compares based on the name column when product_no is the same. In other words, the B+Tree for composite index queries is sorted based on product_no first, and then sorted based on the name field when product_no is the same.

Therefore, when using a composite index, there is a leftmost matching rule: the index is matched in order of priority from the leftmost field, and if not followed, the composite index becomes ineffective.

For example, if a composite index (a, b, c) is created, the following query conditions can match the composite index:
- where a=1;
- where a=1 and b=2 and c=3;
- where b=2 and a=1;

It should be noted that because there is a query optimizer, the order of the a field in the where clause is not important.

However, the following conditions cannot match the composite index because they do not follow the [leftmost matching rule]:
- where b=2;
- where c=3;
- where b=2 and c=3;

The reason why these query conditions fail is because the (a, b, c) composite index is sorted by a first, then by b when a is the same, and then by c when b is the same.
Therefore, b and c are globally unordered and locally relatively ordered, so without following the leftmost matching rule, the index cannot be used.

Here is an example of a composite index (a, b) B+ Tree (the leaf nodes are drawn as singly linked lists, but they are actually doubly linked, the image is incorrect):
image

As can be seen, a is globally ordered (1, 2, 2, 3, 4, 5, 6, 7, 8), while b is globally unordered (12, 7, 8, 2, 3, 8, 10, 5, 2). Therefore, a query condition like where b = 2 cannot utilize the composite index because the key in the index needs to be ordered.

Only when a is the same, b is ordered. For example, when a is 2, the value of b is (7, 8), which is ordered. This ordered state is local. Therefore, when executing where a = 2 and b = 7, both the a and b fields can utilize the composite index, which means the composite index is effective.

Range Queries on Composite Indexes#

There are some special cases with composite indexes. Using a composite index for querying does not mean that all fields in the composite index are used for indexing. There may be cases where some fields use the composite index's B+Tree and some fields do not. This special case occurs in range queries.

The leftmost matching rule of composite indexes will continue to match to the right until encountering a "range query," at which point the matching will stop. This means that the fields in a range query can use the composite index, but the fields after the range query cannot use the composite index.

Causes of the Leftmost Matching Rule of Composite Indexes to Stop Matching in Range Queries#

Q1: In the query select * from t_table where a > 1 and b = 2, which field of the composite index (a, b) is used in the composite index's B+Tree?#

Only the a field is used in the composite index for indexing, while the b field is not used.

Within the range of secondary index records that meet the condition a > 1, the values of the b field are unordered. Therefore, it is not possible to further reduce the number of cascades that need to be scanned based on the condition b = 2.

By examining the key (index used) and key_len (number of fields used in the index) in the execution plan, you can determine the specific number of fields used by the optimizer to form the boundary conditions for the scan range.

    EXPLAIN select * from t_table where a > 1 and b = 2

Both a and b are int and not NULL, and int occupies 4 bytes (if the field allows NULL, add 1 to the number of bytes occupied by the field type, which is 5). In the figure below, key_len is 4, indicating that only the a field is used in the composite index:
image

Q2: In the query select * from t_table where a ≥ 1 and b = 2, which field of the composite index (a, b) is used in the composite index's B+Tree?#

Although the values of the b field are "unordered" within the range of secondary index records that meet the condition a ≥ 1, within the range of secondary index records that meet the condition a = 1, the values of the b field are "ordered" (because for a composite index, it is sorted based on the values of the a field first, and then sorted based on the values of the b field when the values of the a field are the same). In other words, scanning starts from the first record that meets the condition a = 1 and b = 2, without having to start scanning from the first record with a value of 1 for the a field.

Therefore, both the a and b fields are used in the composite index for indexing in the query Q2.

You can determine this by examining the key_len in the execution plan.

Q3: In the query select * from t_table where a between 2 and 8 and b = 2, which field of the composite index (a, b) is used in the composite index's B+Tree?#

In MySQL, BETWEEN includes the boundary values value1 and value2, similar to >= and <=. However, some databases do not include the boundary values value1 and value2 (similar to > and <).

Because MySQL's BETWEEN includes the boundary values value1 and value2, it is similar to the Q2 query, so both the a and b fields are used in the composite index for indexing in the Q3 query.

You can determine this by examining the key_len in the execution plan.

Q4: In the query select * from t_user where name like 'j%' and age = 22, which field of the composite index (name, age) is used in the composite index's B+Tree?#

The a field can be used for indexing in the composite index's B+Tree, and the scan range formed is ['j','k'). Note that j is a closed interval (≥j and <k). The following figure illustrates this:
image

Although the values of the b field are "unordered" within the range of secondary index records with a prefix of 'j', within the range of secondary index records that meet the condition name = j, the values of the age field are "ordered" (because for a composite index, it is sorted based on the values of the name field first, and then sorted based on the values of the age field when the values of the name field are the same).

This means that scanning starts from the first record that meets the condition name = 'j' and age = 22, without having to start scanning from the first record with name = 'j'. The right side of the following figure illustrates this:
image

Therefore, both the name and age fields are used in the composite index for indexing in the Q4 query.

You can determine this by examining the key_len in the execution plan:

  • The name field is of type varchar(30) and not NULL, and the database table uses the utf8mb4 character set, where each character occupies 4 bytes, so the name field can occupy up to 120 bytes (30*4). Since the name field is a variable-length field, an additional 2 bytes are needed to store the length value of the actual data for the field, so the key_len for name is 122.

  • The age field is of type int and not NULL, so the key_len is 4.

    TIP

    Regarding "an additional 2 bytes are needed to store the length value for variable-length fields." In a previous article (https://xiaolincoding.com/mysql/base/row_format.html), it was mentioned that "if the maximum number of bytes that a variable-length field can store is less than or equal to 255 bytes, 1 byte is used to represent the length of the variable-length field." So why is it 2 bytes here?

    The display of key_len is quite special. The row format is implemented by the InnoDB storage engine, while the execution plan is generated in the server layer, so it does not ask the InnoDB storage engine how many bytes the variable-length field occupies. Instead, it always uses 2 bytes to represent the length of the variable-length field, regardless of the actual number of bytes occupied by the field.

    After all, the purpose of key_len is only to tell you which index fields are used in the index query, not to accurately tell you how many bytes of space the field occupies.

The execution plan for the Q4 query is as follows, and you can see that the key_len is 126 bytes, with a key_len of 122 for name and a key_len of 4 for age, indicating that the optimizer used 2 fields as query conditions to form the boundary conditions for the scan range, which means both the name and age fields are used in the composite index for indexing.
image

In summary, the leftmost matching rule of composite indexes stops matching when encountering a range query. Note that the range query fields can use the composite index, but the fields after the range query cannot. Note that the ≥, ≤, between, and "like prefix match" range queries do not stop matching.

Index Pushdown#

For a composite index (a, b), when executing the select * from table where a > 1 and b = 2; statement, only the a field can be used for indexing. After finding the first primary key value (ID 2) that satisfies the condition in the composite index's B+Tree, it still needs to check if the other conditions are satisfied (i.e., b = 2). Is this check done in the composite index or does it go back to the primary key index?

  • Before MySQL 5.6, it was only possible to go back to the primary key index and find the data rows one by one, comparing the value of the b field.
  • MySQL 5.6 introduced the Index Condition Pushdown optimization, which allows filtering based on the fields contained in the composite index (columns that are part of the composite index) during the traversal of the composite index. It directly filters out records that do not satisfy the conditions (b = 2), and then performs a row-by-row lookup based on the filtered data, reducing the number of lookups.

If the execution plan of a query statement includes "Extra: Using index condition," it means that the index pushdown optimization is used.

Index Selectivity#

The order of fields when creating a composite index also has a significant impact on indexing efficiency. The fields that are closer to the front are more likely to be used for index filtering. In actual development work, when creating a composite index, it is important to place fields with high selectivity at the front so that they are more likely to be used by more SQL queries.

Selectivity is the number of different values for a column divided by the total number of rows in the table. The calculation formula is as follows:
image

For example, the selectivity of the gender field is low and it is not suitable for creating an index or placing it at the front of a composite index, while fields like UUID are more suitable for indexing or placing them at the front of a composite index.

If the selectivity of an index is very low, assuming that the values of the field are evenly distributed, then any search for a value may return half of the data. In these cases, it is better not to use an index because MySQL has a query optimizer that will ignore the index when it finds that a value appears in a high percentage (commonly around "30%") of the data rows in the table. In these cases, a full table scan is usually performed.

Sorting with Composite Indexes#

select * from order where status = 1 order by create_time asc

For the above SQL query, creating a composite index (status, create_time) can avoid MySQL performing a file sort.

When querying, if only the status index is used, but the statement also requires sorting by create_time, then a file sort (filesort) is required, which means that the execution plan of the SQL query will have "Extra: Using filesort".

Therefore, to utilize the ordered nature of the index, create a composite index on the status and create_time columns. This way, the data filtered based on the status will already be sorted by create_time, avoiding the need for a file sort and improving query efficiency.

In summary: avoid using filesort by using a covering index.

References#

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.