First, create a product table with the id as the primary key (if no primary key is specified, the first unique column without NULL values will be selected as the index key for the clustered index):
CREATE TABLE `product` (
`id` int NOT NULL,
`product_no` varchar(20) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
`price` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=DYNAMIC;
INSERT INTO product (id, product_no, name, price) VALUES
(1, 'P001', 'Product 1', 10.00),
(2, 'P002', 'Product 2', 20.00),
(3, 'P003', 'Product 3', 30.00),
(4, 'P004', 'Product 4', 40.00),
(5, 'P005', 'Product 5', 50.00),
(6, 'P006', 'Product 6', 60.00),
(7, 'P007', 'Product 7', 70.00),
(8, 'P008', 'Product 8', 80.00),
(9, 'P009', 'Product 9', 90.00),
(10, 'P010', 'Product 10', 100.00);
Row data:
The inserted row data is stored in the B+Tree index as follows:
B+Tree is a multi-way tree, where only leaf nodes store data and non-leaf nodes only store indexes, and the data in each node is stored in primary key order. The index values of each parent node will appear in the index values of the lower-level child nodes. Therefore, in the leaf nodes, all index value information is included, and each leaf node has two pointers, one pointing to the next leaf node and the other pointing to the previous leaf node, forming a doubly linked list (the leaf node in the image below is a singly linked list, but it is actually doubly linked, the image is incorrect).
Primary key index B+Tree
Process of Querying Product Data by Primary Key#
For example, the following query is executed:
select * from product where id= 5;
This statement uses the primary key index to query the product with id 5. The query process is as follows, the B+Tree will search from top to bottom:
- Compare 5 with the index data (1, 10, 20) of the root node. Since 5 is between 1 and 10, according to the search logic of B+Tree, the index data (1, 4, 7) of the second level is found.
- Search in the index data (1, 4, 7) of the second level. Since 5 is between 4 and 7, the index data (4, 5, 6) of the third level is found.
- Search in the index data (4, 5, 6) of the leaf node, and then we find the row data with the index value of 5.
Both the index and data of the database are stored on the hard disk, and we can consider reading a node as a disk I/O operation. Therefore, the entire query process above goes through 3 nodes in total, which means it performs 3 disk I/O operations.
B+Tree can store tens of millions of data with a height of only 3-4 levels, which means that querying target data from a table with tens of millions of data requires a maximum of 3-4 disk I/O operations, so compared to B-trees and binary trees, B+Tree has the advantage of high query efficiency, because even in the case of a large amount of data, the disk I/O for querying a data is still maintained at 3-4 times.
Process of Querying Product Data by Secondary Index#
The difference between the primary key index B+Tree and the secondary index B+Tree is as follows:
- The leaf nodes of the primary key index B+Tree store actual data, and all complete data records are stored in the leaf nodes of the primary key index B+Tree.
- The leaf nodes of the secondary index B+Tree store primary key values, not actual data.
Set the product_no (product code) field in the product table as a secondary index:
CREATE UNIQUE INDEX product_product_no_IDX USING BTREE ON test.product (product_no);
Its B+Tree is as follows (the leaf node in the image below is a singly linked list, but it is actually doubly linked, the image is incorrect):
The non-leaf key values in it are product_no (orange part in the image), and the leaf nodes store the primary key values (green part in the image).
Querying products by product_no secondary index, as follows:
select * from product where product_no = '0002';
It will first check the index values (product code, product_no) in the secondary index B+Tree, find the corresponding leaf node, then obtain the primary key value, and then query the corresponding leaf node in the primary key index B+Tree, and finally obtain the entire row of data. This process is called "lookup", which means that two B+Trees need to be searched to find the data. The image below (the leaf node in the image is a singly linked list, but it is actually doubly linked, the image is incorrect):
However, when the queried data (only querying the primary key) can be found in the leaf node of the secondary index B+Tree, there is no need to query the primary key index. For example, the following query statement:
select id from product where product_no = '0002';
This process, where the result can be found in the leaf node of the secondary index B+Tree, is called "covering index", which means that only one B+Tree needs to be queried to find the data.
Why does MySQL InnoDB choose B+Tree as the data structure for indexes?#
1. B+Tree vs B Tree#
B+Tree only stores data in leaf nodes, while B Tree stores data in non-leaf nodes as well, so the data volume of a single node in B+Tree is smaller, and it can query more nodes with the same number of disk I/O operations.
In addition, the leaf nodes of B+Tree are connected by a doubly linked list, which is suitable for the common range-based sequential search in MySQL, while B Tree cannot achieve this.
2. B+Tree vs Binary Tree#
For a B+Tree with N leaf nodes, its search complexity is O(logdN)
, where d represents the maximum number of child nodes allowed for a node.
In actual applications, the value of d is greater than 100, which ensures that even when the data reaches tens of millions, the height of the B+Tree is still maintained at around 3-4 levels, which means that querying data only requires 3-4 disk I/O operations.
On the other hand, a binary tree can only have 2 child nodes for each parent node, which means its search complexity is O(logN)
, which is already higher than that of B+Tree, so the number of disk I/O operations required for a binary tree to retrieve target data is higher.
3. B+Tree vs Hash#
Hash has fast efficiency for equality queries, with a search complexity of O(1).
However, Hash tables are not suitable for range queries, they are more suitable for equality queries, which is why B+Tree indexes have a wider range of applications compared to Hash table indexes.