chikaku

且听风吟

永远是深夜有多好。
github
email

Overview of the InnoDB Storage Engine

《MySQL Technical Insider: InnoDB Storage Engine (Second Edition)》 reading notes.

Buffer Pool#

For databases, the efficiency of querying and reading is crucial, and generally, databases store data on disk. However, the read and write speeds of disks are relatively slow and cannot meet the high-speed read and write requirements of databases. Therefore, storage engines typically use buffer pool technology to read a portion of the data from the disk into memory for caching, referred to as the buffer pool.

  • For read requests, the system first checks if the corresponding data is in the buffer pool. If it is, the data is read directly from memory and returned. If not, it is read from the disk, placed into memory, and then returned.
  • For write requests, the system first checks if the corresponding data is in the buffer pool. If it is, the memory data is modified and returned. If not, the corresponding data is read from the disk into memory, modified, and then returned. The storage engine will later flush the dirty data back to the disk at certain time points.

The most important parameter of the buffer pool is its size, which directly affects the performance of the database. Imagine if the buffer pool is not large enough, resulting in a low cache hit rate and frequent disk reads, which would significantly reduce efficiency. InnoDB also uses buffer pool technology, with its size configured in innodb_buffer_pool_size. The InnoDB buffer pool caches not only data but also indexes, undo buffers, change buffers, hash indexes, etc. These memory data are organized in memory in the form of pages, with the page size configured in innodb_page_size, which defaults to 16K.

InnoDB supports multiple buffer pool instances, with each page being allocated to different instances based on a hash. This is configured in innodb_buffer_pool_instances, which defaults to 1. Multiple buffer pools can reduce internal resource contention, such as lock contention, and improve concurrency performance.

LRU List#

The task of the buffer pool is to read the required data pages from memory. If the data can be read from memory, it is considered a hit, and the response speed will be very fast. If it cannot be read, it is a miss, and the system needs to read from the disk, which is much slower. The most important parameter to observe the performance of a buffer pool is the cache hit rate.

To improve the cache hit rate, the LRU (Least Recently Used) algorithm is generally used. This algorithm maintains a buffer list, placing the most frequently accessed pages at the front of the list and the least used at the back. If a cache miss occurs and a new page is read from the disk, the last page in the list is discarded, and the newly read page is added to the list. The specific position of the new page in the list depends on the implementation.

InnoDB adopts this algorithm but has made some optimizations. In InnoDB's LRU list, newly read pages are placed at the midpoint of the list. The midpoint is a percentage, with the list after this percentage referred to as the old list and the list before it referred to as the new list. The InnoDB LRU is actually composed of these two separate lists, with the modpoint value configured in innodb_old_blocks_pct, which defaults to 37, meaning newly read pages will be placed 37% from the end of the list.

Why not place them at the front? Imagine if they were placed at the front, treating this page as the most frequently used page. However, for certain SQL operations, such as index or data scans, a large amount of data may be accessed, but this data might only be used once during this scan. If such data is placed at the front every time, it could push out the truly hot data from the LRU list, causing a drop in the hit rate. However, newly read data pages could also be hot data; how can this be determined?

First, it is important to know that pages that hit in the new list will be promoted to the front of the new list. Pages that hit in the old list will also be promoted to the front of the old list.

Now, assuming that the newly read data page is hot data, it will definitely be accessed frequently over a period of time, and each access will promote it to the head of the old list, meaning this page will not be flushed out. However, if it is not hot data but temporary data, it will likely be pushed down quickly, even flushed out of the old list. Thus, by allowing the data page to ensure its survival for a sufficient duration, it can be proven that this is a hot data page. Therefore, InnoDB also adds another parameter configured in innodb_old_blocks_time, which indicates how long a newly read page will remain in the old list before being moved to the new list. Moving a page from old to new is called "page made young," while being flushed out of the old list without surviving long enough is called "page not made young."

Unzip LRU#

In the buffer pool, some data pages may not be 16KB in size. If a page occupies LRU's 16KB page, it can be somewhat wasteful. Therefore, InnoDB has added an unzip_LRU list to manage these non-16KB compressed pages.

Now, suppose we need a 2KB page. The unzip_LRU will first look for a 2KB page:

  • If not found, it will continue to look for a 4KB page. If found, it will split it into two 2KB pages to complete the allocation.
  • If not found, it will continue to look for an 8KB page. If found, it will split it into one 4KB and two 2KB pages to complete the allocation.
  • If not found, it will continue to look for a 16KB page, splitting it into one 8KB, one 4KB, and two 2KB pages to complete the allocation.

Free List#

The previous discussion of the LRU list assumed that the LRU was full. However, in reality, when the database starts, the LRU is empty. At this time, all pages in the buffer pool are placed on the Free list. Each time a new page is read from the disk, the system first checks the Free list for any free pages. If there are free pages, one is taken out and placed into the LRU list. If not, the system will discard a page according to the situation discussed in the LRU.

Flush List#

The above two lists discussed the read situation, but normal database requests also involve many write operations. For write operations, the system first reads the data page from memory according to LRU rules, then modifies this data page. The modified data page is referred to as a dirty page, but dirty pages are not removed from the LRU; instead, they are also added to another Flush list. This way, the data modification is completed in memory, and subsequent reads of this dirty page are still handled by the LRU, ensuring that the latest data is read. Meanwhile, the storage engine will flush all data from the Flush list back to the disk through a certain mechanism. This flushing is not immediate; otherwise, a large number of disk writes would significantly affect the database's performance. Of course, if the database crashes, it may lead to dirty pages not being written back, in which case the durability of disk data is guaranteed by the implementation of transactions.

Checkpoint Mechanism#

Before discussing the Checkpoint technology, let's first address the issue of dirty pages in the Flush list that have not been flushed. In InnoDB's transaction implementation, before each transaction is committed, a redo log is written, also known as a redo log, which is first written to the memory's redo log buffer and then flushed to the disk at certain intervals, recording the complete modification of this transaction to the data. After that, the memory pages are modified, so when the database crashes and restarts, the data can be recovered through the redo log. In this way, ideally, the database does not even need to flush dirty pages back to the disk; it can simply load the redo log upon restart. However, this is only feasible if memory and disk are sufficiently large, which is unrealistic. Therefore, dirty pages still need to be flushed back to the disk. When the database crashes and restarts, it only needs to recover the redo logs that have been flushed back to the disk.

In addition to data modifications, if a dirty page discarded from the LRU old list needs to be flushed back to the disk, otherwise the next read from the disk will retrieve old data. The size of the redo log buffer in memory is configured in innodb_log_buffer_size. If it exceeds this limit, transaction guarantees cannot be ensured, so dirty pages must also be flushed to the disk to ensure there is space in the redo log buffer.

InnoDB uses the Checkpoint mechanism to flush dirty pages back to the disk. Checkpoints are internally divided into two types:

One is Sharp Checkpoint, which flushes all dirty pages back to the disk when the database is shut down normally. Here, normal shutdown refers to when the configuration innodb_fast_shutdown is set to the default value of 1, which flushes the dirty pages in the buffer pool back to the disk without further actions. If this value is 0, the database will perform a complete page recovery and change buffer merging before shutting down, referred to as slow shutdown. If configured to 2, it will only write the log files to the disk and stop. The integrity of table data will need to wait for recovery during the next startup. The specific behavior of this parameter can be found in the official documentation, and there is also a related configuration innodb_force_recovery that controls InnoDB recovery.

The other type is the continuously executed Fuzzy Checkpoint, which flushes a portion of dirty pages to the disk under certain conditions. These conditions include:

  • The main thread asynchronously flushes dirty pages from the buffer pool to the disk at regular intervals.
  • If a data page is evicted from the LRU list and it is a dirty page, it must be flushed back to the disk.
  • When the available space for redo log is low, if it reaches more than 75% of the maximum capacity of the redo log file, asynchronous flushing of dirty pages to the disk is required; otherwise, insufficient redo log space will affect transaction commits. If it exceeds 90%, synchronous flushing to the disk is required.
  • If the number of dirty pages is too high, they must also be flushed back to the disk. The configuration innodb_max_dirty_pages_pct_lwm indicates a lower percentage of dirty pages relative to the buffer pool size. Once this percentage is reached, pre-flushing begins (TODO: What is pre-flushing?), and when it reaches the configured innodb_max_dirty_pages_pct, normal flushing will occur.

Viewing Buffer Pool Status#

By checking the status of the InnoDB storage engine, you can view the running status of InnoDB. Here are several key parameters related to the buffer pool and memory, which can generally be inferred from their names. The sum of Free and LRU does not equal the total buffer pool size because pages in the buffer pool may also be allocated for other purposes, such as hash or change buffer, etc.

> show engine innodb status;

----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 137363456  // Total memory allocated to InnoDB
Buffer pool size   8192                 // Number of pages in the buffer pool, size must be multiplied by page_size
Free buffers       7146                 // Number of pages in the Flush List
Database pages     1042                 // Number of pages in the LRU List
Old database pages 402                  // Number of pages in the old list of the LRU
Buffer pool hit rate 1000 / 1000        // Cache hit rate

InnoDB Engine Features#

Introduction to MySQL and InnoDB#

MySQL is a commonly used relational database, and one of its distinctions from other databases is its plugin-based storage engine. According to the official statement, this plugin-based approach allows you to write your own storage engine and add it to a running MySQL instance without needing to recompile or restart.

InnoDB has been the default storage engine since MySQL 5.5.8. Its design goal is primarily for OLTP, characterized by row-level locking, support for foreign keys, and default read operations that do not produce locks. Each table's data can be stored in a separate .idb file. InnoDB achieves high concurrency through MVCC (Multiversion Concurrency Control). It implements four isolation levels, with the default being REPEATABLE.

InnoDB stores table data in the order of the primary key. If a primary key is not explicitly specified during definition, InnoDB automatically generates a six-byte ROWID as the primary key. If the data volume exceeds the range of this ROWID (281 trillion rows), the earliest data will be overwritten.

Change Buffer#

Typically, each table in a database has a clustered index (commonly referred to as the primary key) and may also have multiple auxiliary indexes. Suppose we need to insert a new record into the table; at this point, the index tree needs to be updated. For the case of having only one clustered index, this is straightforward because clustered indexes are generally sequentially increasing. When adding a new row, it simply needs to be written sequentially, adding the new index to the end of the index tree. However, if there are auxiliary indexes, the position for the new index must be found in the auxiliary index tree. If this index page is not in the buffer pool, random disk reads and writes will be required. For a large number of write requests, this repeated random disk I/O significantly impacts performance.

Therefore, InnoDB introduced a change buffer scheme. When inserting, updating, or deleting operations on auxiliary indexes occur, if the index page is in the buffer pool, it is modified directly in the buffer pool. If it is not in the buffer pool, it is first placed in the change buffer, waiting for other operations, such as a read request for the current index page, which will then require reading it from the disk into the buffer pool. The change buffer then merges its modifications, thus avoiding the random disk I/O required for each insert/update/delete.

Additionally, if an index page has not been read into the buffer pool for a period but has undergone multiple accesses or modifications, these multiple operations will also be merged on the corresponding index page in the change buffer, requiring only one write back to the disk later.

Moreover, the change buffer has certain limitations. First, the index must be an auxiliary index, and it cannot be unique. If the index needs to be unique, then during insert/update, it is unavoidable to scan all index pages to confirm whether there are identical indexes, making the change buffer meaningless.

The change buffer is divided into three types:

  • Insert buffer indicates a buffer for an insert operation.
  • Delete buffer indicates marking a record for deletion.
  • Purge buffer indicates the actual deletion of a record.

In InnoDB, an update is accomplished through a combination of delete and purge operations. The configuration innodb_change_buffering can specify the types of change buffer buffering, with the default being all, meaning all types are buffered.

Internally, the structure of the change buffer in InnoDB is a B+ tree. This tree also has size limitations, configured in innodb_change_buffer_max_size, which indicates the maximum percentage of the change buffer relative to the total buffer pool size, with a default of 25. If the amount of change buffer reaches a certain size, it will also force the reading of index pages for merging to avoid insufficient change buffer space.

Double Write#

During the operation of the database, data in the buffer pool is continuously synchronized to the disk to ensure data persistence. During this process, issues may arise, such as when a 16KB data page is being written to the disk, and only 4KB has been written when the database crashes. At this point, even if there is a redo log, the page on the disk has already been partially written, and the physical log stored in the redo log cannot restore the complete page (for detailed reasons, see the following redo log section). Therefore, before using the redo log, this page needs to be backed up. If the write fails, the backup page can restore the disk page, and then the redo log can attempt recovery.

InnoDB uses doublewrite technology to solve this problem. There is a region in memory called the doublewrite buffer, which is 2MB in size, and there is also a 2MB space in the disk's shared tablespace. During the flushing of dirty pages from the buffer pool, the dirty page is first written to the disk's shared tablespace and executed with fsync, and then written to the various table data files. Thus, even if an error occurs while writing the data files, a copy of the dirty page can be retrieved from the shared tablespace. (TODO: Why is it divided into two writes of 1MB each when flushing to the shared tablespace?)

The doublewrite function can be disabled by configuring innodb_doublewrite. This feature is unnecessary on certain file systems that provide write failure protection.

Asynchronous I/O#

InnoDB uses asynchronous I/O to flush data to the disk, which prevents thread blocking, and AIO can also perform I/O merging to reduce the number of disk I/O operations. For information on AIO under Linux, see the man documentation.

You can enable or disable AIO by configuring innodb_use_native_aio.

Flushing Neighboring Pages#

When InnoDB flushes a dirty page back to the disk, it can check all pages in the area where this page is located. If there are other dirty pages, they will also be flushed to the disk, thus utilizing I/O merging to reduce the number of I/O operations. This can be toggled using the innodb_flush_neighbors configuration.

InnoDB Worker Threads#

Having looked at some of InnoDB's runtime status, let's introduce how InnoDB operates.

Master Thread#

The master thread is the main thread of InnoDB's operation, responsible for most of the work. Its main tasks include:

  • Flushing dirty pages to the disk.
  • Flushing log files to the disk, even if transactions have not yet been committed, to improve transaction commit speed.
  • Merging change buffers.
  • Recycling undo pages.

I/O Thread#

The I/O thread is primarily responsible for AIO request callbacks. The number of read and write threads can be changed through the configurations innodb_read_io_threads and innodb_write_io_threads.

Purge Thread#

During the execution of transactions, an undo log is recorded to roll back data in case of transaction failure. The purge thread's job is to recycle those undo pages that are no longer needed after the transaction execution. The number of this thread can be modified through the configuration innodb_purge_threads.

Log Files#

The logs discussed here are actually MySQL logs and are not related to InnoDB. They are included here because they are frequently needed for viewing and processing.

Error Log#

The error log records errors and warnings during the startup, operation, and shutdown processes of MySQL. It can be viewed through the log_error variable.

Slow Query Log#

The slow query log records SQL statements that exceed the configured long_query_time during execution, with a default of 10 seconds. It can be enabled through slow_query_log, which is off by default. The log's recording location is specified in the slow_query_log_file configuration.

General Query Log#

The general query log records all database requests in the general_log_file. It can be toggled on or off through the general_log configuration. It is generally turned off, as keeping it on continuously can impact performance. It is usually only enabled during debugging.

Binary Log#

The binary log records all changes made to the database (excluding operations like select and show). It can be used for backup recovery and master-slave replication. This log is recorded in the datadir under the bin_log.xxxx files. The bin_log.index is the binary index file that stores the previous bin log sequence numbers.

During the execution of InnoDB transactions, binary logs are recorded in the buffer and written to the disk's bin log after the transaction is committed. The sync_binlog configuration adjusts how many times the buffer is written before syncing to the disk file. The default is 1, meaning the buffer is flushed to the disk after each write. If this parameter is set too high, it may lead to a situation where the bin log has not been flushed to the disk before a crash, causing data inconsistency between master and slave.

Storage Structure#

InnoDB stores all data in a space called the shared tablespace. The organizational units of the tablespace, from largest to smallest, are: segment, extent, and page, with each level composed of multiple subordinate units.

Table Structure File#

In MySQL, any storage engine will have a .frm text file that stores the structure of the table and the structure of user-created views.

Tablespace File#

By default, InnoDB has a shared tablespace for all tables. The configuration innodb_file_per_table can place each table's data into a separate tablespace. However, in this case, each table's tablespace only stores data, while indexes and change buffer bitmap pages remain in the original shared tablespace. The suffix for tablespace files is .ibd, and the default shared tablespace file is ibdata1.

Tablespace files are composed of various segments, such as data segments, index segments, and undo segments. Segments consist of multiple extents, which in turn consist of multiple pages, and the pages within an extent are contiguous. Pages are the smallest unit within InnoDB. There are many types of pages, such as data pages, undo pages, transaction data pages, etc.

Row Record Format#

InnoDB stores data in rows within data pages, kept in the nodes of a B+ tree. If a column's data type has a variable length, such as text, the data may be stored in overflow pages, with the original data row retaining an offset to the overflow row. For fixed-length data, if the data length is too large, causing only one row to fit in a page and not satisfying the B+ tree structure, InnoDB will automatically store the row data in overflow pages.

Data Partitioning#

Data partitioning is a feature supported by MySQL that allows a table or index data to be physically divided into multiple parts. Currently, MySQL only supports horizontal partitioning, which distributes different rows of a table into different physical files. It also supports local partitioned indexes, meaning each physical file contains both data and indexes.

MySQL supports the following types of partitioning:

  • Range partitioning, which divides data into specified continuous row intervals, with each interval stored in a partition (1, 2, 3, 4) (5, 6, 7, 8).
  • List partitioning, similar to range, but the partitioned data can be discrete, such as (1, 3, 5, 7) (2, 4, 6, 8).
  • Hash partitioning, which distributes data into different partitions based on the return value of a user-defined expression, such as hash(id % 1000).
  • Key partitioning, which partitions data using a hash function provided by the database, requiring only the field to be hashed to be specified.

On the basis of partitioning, MySQL can further partition, referred to as sub-partitioning. MySQL allows hash or key partitioning on range and list partitions. For specific partitioning syntax, refer to the official documentation.

Indexes#

Indexes are the key technology for improving database query performance. InnoDB supports various types of indexes: B+ tree indexes, full-text indexes, hash indexes, etc. Generally, developers use B+ tree indexes more frequently, so this discussion will focus on B+ tree indexes. B+ tree indexes can also be categorized into several types.

Clustered Index#

Each InnoDB data table has a primary key, and the clustered index is a B+ tree constructed according to the table's primary key. The leaf nodes of this tree are the data pages of the table, storing complete row data. Each data page is connected through a doubly linked list, maintaining the same order as the data stored in the table by primary key. All table data queries, except for cache hits and some special cases, will search for the corresponding data pages through the clustered index tree.

Clustered indexes also have physical data on disk, but they are not physically contiguous on disk; rather, they are logically contiguous. For example, the last item of one index node points to the offset of the next index node in the file. Otherwise, maintaining clustered indexes would be very costly.

Auxiliary Index#

Each auxiliary index is also a separate B+ tree, but the leaf nodes do not store complete row data; they only store the auxiliary index keys and the table's primary key. When querying a row through an auxiliary index, the primary key is first obtained, and then the complete row data is found through the clustered index tree. There is a special case where the queried data is within the auxiliary index keys; in this case, there is no need to look up the clustered index tree, which is also known as a "covering index."

Composite Index#

A composite index refers to indexing multiple columns on a table. Its usage is similar to that of a single-key auxiliary index. The arrangement order of the index is based on the order of the index declaration. For example, for a composite index (a, b), the data is arranged by first sorting a and then sorting b, such as (1, 1) (1, 2) (2, 1) (2, 2), and so on for more keys.

For queries with multiple key values, such as where a = xx and b = xx, a composite index can be used. Additionally, in a composite index, sorting is performed based on the first key, so for a query like where a = xx, this single key value query can also utilize the composite index.

Another usage is for queries like where a = x group by b limit n. For a single auxiliary index, after querying, it still requires sorting b. However, in a composite index, if the first field a is a constant value, the second field b is already sorted, reducing the sorting operations.

Adaptive Hash Index#

InnoDB uses B+ trees to establish indexes, and the number of index lookups is related to the depth of the B+. For example, in a tree with a depth of 3, an index lookup may require 3 queries. If we frequently access a piece of data, experiencing 3 index queries each time is somewhat wasteful. InnoDB internally establishes an adaptive hash index (AHI) for each table to cache certain hot data and speed up data queries.

For instance, if we continuously request select a from t where id = 1;, the AHI will create an index for the pattern where id = 1. When subsequent identical requests are made, there is no need to query the B+ tree index; the data page can be obtained directly from the adaptive hash index. However, hash indexes can only perform equality queries and cannot cache range queries.

There are certain conditions for establishing hash indexes, but since hash indexes are an internal optimization of InnoDB, they will not be discussed in detail here. Users can toggle the AHI on or off through the configuration innodb_adaptive_hash_index.

Locks and Transactions#

Database access must be concurrent, and to ensure data consistency, locks are necessary.

InnoDB has two types of locks:

  • Latch: A lightweight lock that is held for a very short time, generally used to protect thread data. Latches are further divided into two types: mutex and rwlock.
  • Lock: This type of lock applies to transactions, locking database objects such as tables and rows. The duration of these locks is longer, being held until the transaction commits or rolls back.

The focus here is primarily on transaction-related locks. In transactions, InnoDB applies row-level locks on table data.

Row Locks and Intention Locks#

InnoDB supports two types of row locks:

  • Shared Lock (S Lock): Allows a transaction to read a row of data. Shared locks are compatible with any other locks.
  • Exclusive Lock (X Lock): Allows a transaction to delete or update a row of data. Exclusive locks are incompatible with any other locks.

Assuming there is a transaction T1 that wants to acquire data from a certain row, it needs to obtain an S lock on that row. If another transaction T2 also wants to access that row and requires an S lock, the two S locks are compatible, so there is no issue. However, if a transaction T3 wants to modify that row, it will need an X lock, which is incompatible with the S lock, causing T3 to block until T1 and T2 release their S locks.

In addition to row locks, InnoDB also provides an Intention Lock at the table level.

  • If modifications to certain rows are needed, an intention exclusive lock (IX) must be obtained on the table before acquiring the X lock on those rows, indicating that exclusive locks will be requested on certain rows.
  • If certain rows are to be read, an intention shared lock (IS) must be obtained on the table before acquiring the S lock on those rows, indicating that shared locks will be requested on certain rows.

Intention locks primarily express the type of locks that a transaction intends to acquire on certain rows in advance. They do not conflict with table-level locks and are mainly used for comparison and testing against other table locks. For example, if transaction T1 holds an X lock on a certain row of data, it must have also acquired the IX lock. If there is a transaction T2 that intends to update the entire table, it will need to acquire an X lock on the entire table. T2 must know whether any rows are currently held by an X lock. Scanning each row would be too slow, so it can query the intention locks on the table to see that there is an IX lock, indicating that some rows are held by an X lock, causing T2 to block until the IX lock is released.

The following table shows the compatibility matrix, where all comparisons of IS, IX and IS, IX, S, X are comparisons of table-level locks. For example, the incompatibility of IS and X indicates that the table's intention shared lock is incompatible with the table's exclusive lock, not a comparison between intention locks and row-level locks.

ISIXSX
ISCompatibleCompatibleCompatibleConflicts
IXCompatibleCompatibleConflictsConflicts
SCompatibleConflictsCompatibleConflicts
XConflictsConflictsConflictsConflicts

Consistent Non-Locking Reads#

Consistent non-locking reads refer to InnoDB reading rows in the database through MVCC. If the current row is locked with an X lock, it can avoid blocking by reading a snapshot of the row. The read snapshot is implemented using the transaction's undo log. This method can greatly enhance the concurrency performance of the database.

In the RC and RR isolation levels, InnoDB uses consistent non-locking reads. However, in RC, the snapshot data read is the latest data at that moment, while in RR, the snapshot data is the data at the start of the transaction.

In certain cases, if users need to ensure data consistency, they can use locking methods for consistent locking reads. InnoDB supports two select locking modes:

  • select ... for update will apply an X lock on the row data.
  • select ... lock in share mode will apply an S lock on the data.

These two statements must be used within a transaction, and the locks will be automatically released when the transaction commits or rolls back.

Auto-Increment Lock#

For auto-incrementing columns, a locking mechanism is needed to ensure data consistency under concurrent writes. Each auto-increment column has a counter in memory used to allocate new values for newly inserted rows.

InnoDB has a special auto-increment lock called AUTO-INC Locking to maintain this value. Each time a transaction adds a row, the auto-increment lock will execute a statement select max(auto_inc) from t for update to obtain the new auto-increment column value. After obtaining it, this auto-increment lock is released immediately, without waiting for the transaction to finish. However, this method still locks the table, which is inefficient under high concurrency, as transactions must wait for others to finish using the auto-increment lock. Additionally, if a transaction rolls back, the auto-increment value is discarded, leading to gaps in the auto-increment column.

InnoDB allows configuring the value of innodb_autoinc_lock_mode to control the locking strategy for auto-increment columns:

  • 0 uses the auto-increment lock, which is inefficient.
  • 1 is the default value, where for insert and replace statements, a mutex lock is applied to the counter in memory. There is no transaction lock, making it relatively faster. Other types of inserts still use the auto-increment lock.
  • 2 uses mutex locks for all inserts, which is highly efficient but may lead to non-continuous auto-increment values.

For more detailed information about auto-increment columns, refer to the official documentation.

Lock Algorithms#

There are three algorithms for row locks:

  • Record Lock: Locks a single row. For example, when we want to modify a certain row of data.
  • Gap Lock: Locks a range, excluding the record itself. For example, if the current table contains the values 1, 3, 5, and 7, and a transaction wants to add the value 4, it must lock the range (3, 5) to prevent other transactions from inserting data in between. If it also wants to insert the value 8, it needs to add a range lock on (7, +∞). The main purpose of Gap Lock is to prevent multiple transactions from inserting into the same range. Setting the isolation level to RC can disable Gap Lock.
  • Next-Key Lock: This is a combination of the previous two locks, locking both the record itself and a range. This lock is primarily used to solve the phantom read problem.

When the queried index contains unique attributes, InnoDB will downgrade the Next-Key Lock to a Record Lock. For example, if a transaction needs to insert a row into the table with id=7 ... and id is a unique index, it is sufficient to lock only row 7 without needing to lock the range.

Phantom reads occur when, within the same transaction, executing the same SQL statement twice may yield different results, with the second execution potentially returning rows that did not exist previously.

For instance, if transaction T1 first executes the statement select id from t where id > 10; and no data is returned, but simultaneously, another transaction T2 executes insert into t (id) values (11); and then T2 commits first. When T1 executes select id from t where id > 10; again, it suddenly finds data. This behavior violates transaction isolation, as one transaction can perceive the results of another.

By using Next-Key Lock, when we execute the statement select id from t where id > 10;, an X lock will be added on (10, +∞), preventing T2 from inserting values in that range. When T1 executes the same statement again, the result will remain unchanged because that range is already locked with an X lock, preventing any transaction from modifying the data within that range.

InnoDB's default transaction isolation level RR uses Next-Key Lock to protect transactions and avoid phantom reads.

Using Locks to Solve Transaction Isolation Issues#

  • Dirty Read: This occurs when one transaction can read modifications made by another transaction that has not yet been committed. Dirty reads only happen under the RU isolation level. At least under the RC level, uncommitted data from transactions will not be visible to other transactions.
  • Non-Repeatable Reads/Phantom Reads: The solution has already been discussed above and will not be repeated.
  • Lost Updates: This refers to two transactions updating the same row, where the result of the first committed transaction is overwritten by the second. This is not inherently a database issue but rather a potential outcome of concurrent transactions. One solution is to use the consistent locking reads mentioned above to serialize transactions, which means using the SERIALIZABLE isolation level.

Transaction ACID and Isolation Levels#

With strong lock protection, InnoDB implements transactions that fully comply with ACID:

  • Atomicity: A transaction is an indivisible unit; all operations within a transaction must succeed for the transaction to be considered successful. If any operation fails, all previously executed operations must be rolled back.
  • Consistency: The database's constraints must not be violated before and after the transaction begins and ends. For example, unique constraints.
  • Isolation: The read and write operations of each transaction are invisible to other transactions until they are committed.
  • Durability: Once a transaction is committed, the result is permanent, even in the event of a crash.

The SQL standard defines four isolation levels for transactions:

  • READ UNCOMMITTED: Allows a transaction to read uncommitted modifications made by other transactions, leading to dirty reads.
  • READ COMMITTED: Allows a transaction to read modifications made by other transactions that have been committed, but it may encounter phantom reads.
  • REPEATABLE READ: Allows a transaction to read new records inserted by other transactions that have been committed but not modifications made by other transactions. This is the default isolation level used by InnoDB, and it solves phantom reads through Next-Key Lock.
  • SERIALIZABLE: Each read and write operation in a transaction requires acquiring table-level shared locks.

Generally, the lower the isolation level, the less lock protection there is, and the shorter the duration of locks.

InnoDB Transaction's Redo Log and Undo Log#

Before a transaction is committed, all logs must be written to the redo log file. This ensures that even in the event of a crash, recovery can be performed from the redo log, guaranteeing data durability. To ensure that the redo log can be written to the file, every write to the redo log involves an fsync. The innodb_flush_log_at_trx_commit configuration can adjust the strategy for flushing the redo log to disk. The default is 1, meaning fsync is executed every time to ensure writing. When set to 0, the redo log is not written to the file upon transaction commit. When set to 2, the redo log is written to the file system but fsync is not executed, which may lead to data loss if the log is not flushed to disk before a crash. The strategy for flushing the redo log significantly affects the speed of transaction commits.

During the execution of a transaction, in addition to the redo log, undo logs are also generated. When a transaction rolls back, these undo logs are executed to revert the data to its original state. Undo logs are logical logs and do not restore the data to the state at the beginning of the transaction, as multiple concurrent transactions may have modified the original data.

Transaction Control Statements#

In the default settings of the MySQL command line, transactions are automatically committed, meaning each SQL statement executed will immediately perform a COMMIT. You can disable auto-commit by using SET AUTOCOMMIT=0. You can also explicitly start a transaction using the BEGIN command. Here is a summary of transaction control statements:

  • BEGIN: Starts a transaction.
  • COMMIT: Commits the transaction.
  • ROLLBACK: Rolls back the transaction.
  • SAVEPOINT id: Creates a savepoint with the specified id.
  • RELEASE SAVEPOINT id: Deletes a savepoint.
  • ROLLBACK TO id: Rolls back to the specified savepoint, executing only the operations before this savepoint without rolling back the savepoint itself.
  • SET TRANSACTION: Sets the isolation level for the transaction.

References and Citations
MySQL Technical Insider: InnoDB Storage Engine (Second Edition)
MySQL InnoDB Official Documentation

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.