You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1374 lines
37 KiB

MDEV-12698 innodb.innodb_stats_del_mark test failure In my merge of the MySQL fix for Oracle Bug#23333990 / WL#9513 I overlooked some subsequent revisions to the test, and I also failed to notice that the test is actually always failing. Oracle introduced the parameter innodb_stats_include_delete_marked but failed to consistently take it into account in FOREIGN KEY constraints that involve CASCADE or SET NULL. When innodb_stats_include_delete_marked=ON, obviously the purge of delete-marked records should update the statistics as well. One more omission was that statistics were never updated on ROLLBACK. We are fixing that as well, properly taking into account the parameter innodb_stats_include_delete_marked. dict_stats_analyze_index_level(): Simplify an expression. (Using the ternary operator with a constant operand is unnecessary obfuscation.) page_scan_method_t: Revert the change done by Oracle. Instead, examine srv_stats_include_delete_marked directly where it is needed. dict_stats_update_if_needed(): Renamed from row_update_statistics_if_needed(). row_update_for_mysql_using_upd_graph(): Assert that the table statistics are initialized, as guaranteed by ha_innobase::open(). Update the statistics in a consistent way, both for FOREIGN KEY triggers and for the main table. If FOREIGN KEY constraints exist, do not dereference a freed pointer, but cache the proper value of node->is_delete so that it matches prebuilt->table. row_purge_record_func(): Update statistics if innodb_stats_include_delete_marked=ON. row_undo_ins(): Update statistics (on ROLLBACK of a fresh INSERT). This is independent of the parameter; the record is not delete-marked. row_undo_mod(): Update statistics on the ROLLBACK of updating key columns, or (if innodb_stats_include_delete_marked=OFF) updating delete-marks. innodb.innodb_stats_persistent: Renamed and extended from innodb.innodb_stats_del_mark. Reduced the unnecessarily large dataset from 262,144 to 32 rows. Test both values of the configuration parameter innodb_stats_include_delete_marked. Test that purge is updating the statistics. innodb_fts.innodb_fts_multiple_index: Adjust the result. The test is performing a ROLLBACK of an INSERT, which now affects the statistics. include/wait_all_purged.inc: Moved from innodb.innodb_truncate_debug to its own file.
9 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
11 years ago
11 years ago
10 years ago
11 years ago
11 years ago
10 years ago
10 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-14407 Assertion failure during rollback Rollback attempted to dereference DB_ROLL_PTR=0, which cannot possibly be a valid undo log pointer. A safer canonical value would be roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS which is what was chosen in MDEV-12288, corresponding to reset_trx_id. No deterministic test case for the bug was found. The simplest test cases may be related to MDEV-11415, which suppresses undo logging for ALGORITHM=COPY operations. In those operations, in the spirit of MDEV-12288, we should actually have written reset_trx_id instead of using the transaction identifier of the current transaction (and a bogus value of DB_ROLL_PTR=0). However, thanks to MySQL Bug#28432 which I had fixed in MySQL 5.6.8 as part of WL#6255, access to the rebuilt table by earlier-started transactions should actually have been refused with ER_TABLE_DEF_CHANGED. reset_trx_id: Move the definition to data0type.cc and the declaration to data0type.h. btr_cur_ins_lock_and_undo(): When undo logging is disabled, use the safe value that corresponds to reset_trx_id. btr_cur_optimistic_insert(): Validate the DB_TRX_ID,DB_ROLL_PTR before inserting into a clustered index leaf page. ins_node_t::sys_buf[]: Replaces row_id_buf and trx_id_buf and some heap usage. row_ins_alloc_sys_fields(): Init ins_node_t::sys_buf[] to reset_trx_id. row_ins_buf(): Only if undo logging is enabled, copy trx->id to node->sys_buf. Otherwise, rely on the initialization in row_ins_alloc_sys_fields(). row_purge_reset_trx_id(): Invoke mlog_write_string() with reset_trx_id directly. (No functional change.) trx_undo_page_report_modify(): Assert that the DB_ROLL_PTR is not 0. trx_undo_get_undo_rec_low(): Assert that the roll_ptr is valid before trying to dereference it. dict_index_t::is_primary(): Check if the index is the primary key. PageConverter::adjust_cluster_record(): Fix MDEV-15249 Crash in MVCC read after IMPORT TABLESPACE by resetting the system fields to reset_trx_id instead of writing the current transaction ID (which will be committed at the end of the IMPORT TABLESPACE) and DB_ROLL_PTR=0. This can partially be viewed as a follow-up fix of MDEV-12288, because IMPORT should already then have written DB_TRX_ID=0 and DB_ROLL_PTR=1<<55 to prevent unnecessary DB_TRX_ID lookups in subsequent accesses to the table.
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-11415 Remove excessive undo logging during ALTER TABLE…ALGORITHM=COPY If a crash occurs during ALTER TABLE…ALGORITHM=COPY, InnoDB would spend a lot of time rolling back writes to the intermediate copy of the table. To reduce the amount of busy work done, a work-around was introduced in commit fd069e2bb36a3c1c1f26d65dd298b07e6d83ac8b in MySQL 4.1.8 and 5.0.2, to commit the transaction after every 10,000 inserted rows. A proper fix would have been to disable the undo logging altogether and to simply drop the intermediate copy of the table on subsequent server startup. This is what happens in MariaDB 10.3 with MDEV-14717,MDEV-14585. In MariaDB 10.2, the intermediate copy of the table would be left behind with a name starting with the string #sql. This is a backport of a bug fix from MySQL 8.0.0 to MariaDB, contributed by jixianliang <271365745@qq.com>. Unlike recent MySQL, MariaDB supports ALTER IGNORE. For that operation InnoDB must for now keep the undo logging enabled, so that the latest row can be rolled back in case of an error. In Galera cluster, the LOAD DATA statement will retain the existing behaviour and commit the transaction after every 10,000 rows if the parameter wsrep_load_data_splitting=ON is set. The logic to do so (the wsrep_load_data_split() function and the call handler::extra(HA_EXTRA_FAKE_START_STMT)) are joint work by Ji Xianliang and Marko Mäkelä. The original fix: Author: Thirunarayanan Balathandayuthapani <thirunarayanan.balathandayuth@oracle.com> Date: Wed Dec 2 16:09:15 2015 +0530 Bug#17479594 AVOID INTERMEDIATE COMMIT WHILE DOING ALTER TABLE ALGORITHM=COPY Problem: During ALTER TABLE, we commit and restart the transaction for every 10,000 rows, so that the rollback after recovery would not take so long. Fix: Suppress the undo logging during copy alter operation. If fts_index is present then insert directly into fts auxiliary table rather than doing at commit time. ha_innobase::num_write_row: Remove the variable. ha_innobase::write_row(): Remove the hack for committing every 10000 rows. row_lock_table_for_mysql(): Remove the extra 2 parameters. lock_get_src_table(), lock_is_table_exclusive(): Remove. Reviewed-by: Marko Mäkelä <marko.makela@oracle.com> Reviewed-by: Shaohua Wang <shaohua.wang@oracle.com> Reviewed-by: Jon Olav Hauglid <jon.hauglid@oracle.com>
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12289 Keep 128 persistent rollback segments for compatibility and performance InnoDB divides the allocation of undo logs into rollback segments. The DB_ROLL_PTR system column of clustered indexes can address up to 128 rollback segments (TRX_SYS_N_RSEGS). Originally, InnoDB only created one rollback segment. In MySQL 5.5 or in the InnoDB Plugin for MySQL 5.1, all 128 rollback segments were created. MySQL 5.7 hard-codes the rollback segment IDs 1..32 for temporary undo logs. On upgrade, unless a slow shutdown (innodb_fast_shutdown=0) was performed on the old server instance, these rollback segments could be in use by transactions that are in XA PREPARE state or transactions that were left behind by a server kill followed by a normal shutdown immediately after restart. Persistent tables cannot refer to temporary undo logs or vice versa. Therefore, we should keep two distinct sets of rollback segments: one for persistent tables and another for temporary tables. In this way, all 128 rollback segments will be available for both types of tables, which could improve performance. Also, MariaDB 10.2 will remain more compatible than MySQL 5.7 with data files from earlier versions of MySQL or MariaDB. trx_sys_t::temp_rsegs[TRX_SYS_N_RSEGS]: A new array of temporary rollback segments. The trx_sys_t::rseg_array[TRX_SYS_N_RSEGS] will be solely for persistent undo logs. srv_tmp_undo_logs. Remove. Use the constant TRX_SYS_N_RSEGS. srv_available_undo_logs: Change the type to ulong. trx_rseg_get_on_id(): Remove. Instead, let the callers refer to trx_sys directly. trx_rseg_create(), trx_sysf_rseg_find_free(): Remove unneeded parameters. These functions only deal with persistent undo logs. trx_temp_rseg_create(): New function, to create all temporary rollback segments at server startup. trx_rseg_t::is_persistent(): Determine if the rollback segment is for persistent tables. trx_sys_is_noredo_rseg_slot(): Remove. The callers must know based on context (such as table handle) whether the DB_ROLL_PTR is referring to a persistent undo log. trx_sys_create_rsegs(): Remove all parameters, which were always passed as global variables. Instead, modify the global variables directly. enum trx_rseg_type_t: Remove. trx_t::get_temp_rseg(): A method to ensure that a temporary rollback segment has been assigned for the transaction. trx_t::assign_temp_rseg(): Replaces trx_assign_rseg(). trx_purge_free_segment(), trx_purge_truncate_rseg_history(): Remove the redundant variable noredo=false. Temporary undo logs are discarded immediately at transaction commit or rollback, not lazily by purge. trx_purge_mark_undo_for_truncate(): Remove references to the temporary rollback segments. trx_purge_mark_undo_for_truncate(): Remove a check for temporary rollback segments. Only the dedicated persistent undo log tablespaces can be truncated. trx_undo_get_undo_rec_low(), trx_undo_get_undo_rec(): Add the parameter is_temp. trx_rseg_mem_restore(): Split from trx_rseg_mem_create(). Initialize the undo log and the rollback segment from the file data structures. trx_sysf_get_n_rseg_slots(): Renamed from trx_sysf_used_slots_for_redo_rseg(). Count the persistent rollback segment headers that have been initialized. trx_sys_close(): Also free trx_sys->temp_rsegs[]. get_next_redo_rseg(): Merged to trx_assign_rseg_low(). trx_assign_rseg_low(): Remove the parameters and access the global variables directly. Revert to simple round-robin, now that the whole trx_sys->rseg_array[] is for persistent undo log again. get_next_noredo_rseg(): Moved to trx_t::assign_temp_rseg(). srv_undo_tablespaces_init(): Remove some parameters and use the global variables directly. Clarify some error messages. Adjust the test innodb.log_file. Apparently, before these changes, InnoDB somehow ignored missing dedicated undo tablespace files that are pointed by the TRX_SYS header page, possibly losing part of essential transaction system state.
9 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
10 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
10 years ago
10 years ago
MDEV-11415 Remove excessive undo logging during ALTER TABLE…ALGORITHM=COPY If a crash occurs during ALTER TABLE…ALGORITHM=COPY, InnoDB would spend a lot of time rolling back writes to the intermediate copy of the table. To reduce the amount of busy work done, a work-around was introduced in commit fd069e2bb36a3c1c1f26d65dd298b07e6d83ac8b in MySQL 4.1.8 and 5.0.2, to commit the transaction after every 10,000 inserted rows. A proper fix would have been to disable the undo logging altogether and to simply drop the intermediate copy of the table on subsequent server startup. This is what happens in MariaDB 10.3 with MDEV-14717,MDEV-14585. In MariaDB 10.2, the intermediate copy of the table would be left behind with a name starting with the string #sql. This is a backport of a bug fix from MySQL 8.0.0 to MariaDB, contributed by jixianliang <271365745@qq.com>. Unlike recent MySQL, MariaDB supports ALTER IGNORE. For that operation InnoDB must for now keep the undo logging enabled, so that the latest row can be rolled back in case of an error. In Galera cluster, the LOAD DATA statement will retain the existing behaviour and commit the transaction after every 10,000 rows if the parameter wsrep_load_data_splitting=ON is set. The logic to do so (the wsrep_load_data_split() function and the call handler::extra(HA_EXTRA_FAKE_START_STMT)) are joint work by Ji Xianliang and Marko Mäkelä. The original fix: Author: Thirunarayanan Balathandayuthapani <thirunarayanan.balathandayuth@oracle.com> Date: Wed Dec 2 16:09:15 2015 +0530 Bug#17479594 AVOID INTERMEDIATE COMMIT WHILE DOING ALTER TABLE ALGORITHM=COPY Problem: During ALTER TABLE, we commit and restart the transaction for every 10,000 rows, so that the rollback after recovery would not take so long. Fix: Suppress the undo logging during copy alter operation. If fts_index is present then insert directly into fts auxiliary table rather than doing at commit time. ha_innobase::num_write_row: Remove the variable. ha_innobase::write_row(): Remove the hack for committing every 10000 rows. row_lock_table_for_mysql(): Remove the extra 2 parameters. lock_get_src_table(), lock_is_table_exclusive(): Remove. Reviewed-by: Marko Mäkelä <marko.makela@oracle.com> Reviewed-by: Shaohua Wang <shaohua.wang@oracle.com> Reviewed-by: Jon Olav Hauglid <jon.hauglid@oracle.com>
8 years ago
10 years ago
MDEV-12698 innodb.innodb_stats_del_mark test failure In my merge of the MySQL fix for Oracle Bug#23333990 / WL#9513 I overlooked some subsequent revisions to the test, and I also failed to notice that the test is actually always failing. Oracle introduced the parameter innodb_stats_include_delete_marked but failed to consistently take it into account in FOREIGN KEY constraints that involve CASCADE or SET NULL. When innodb_stats_include_delete_marked=ON, obviously the purge of delete-marked records should update the statistics as well. One more omission was that statistics were never updated on ROLLBACK. We are fixing that as well, properly taking into account the parameter innodb_stats_include_delete_marked. dict_stats_analyze_index_level(): Simplify an expression. (Using the ternary operator with a constant operand is unnecessary obfuscation.) page_scan_method_t: Revert the change done by Oracle. Instead, examine srv_stats_include_delete_marked directly where it is needed. dict_stats_update_if_needed(): Renamed from row_update_statistics_if_needed(). row_update_for_mysql_using_upd_graph(): Assert that the table statistics are initialized, as guaranteed by ha_innobase::open(). Update the statistics in a consistent way, both for FOREIGN KEY triggers and for the main table. If FOREIGN KEY constraints exist, do not dereference a freed pointer, but cache the proper value of node->is_delete so that it matches prebuilt->table. row_purge_record_func(): Update statistics if innodb_stats_include_delete_marked=ON. row_undo_ins(): Update statistics (on ROLLBACK of a fresh INSERT). This is independent of the parameter; the record is not delete-marked. row_undo_mod(): Update statistics on the ROLLBACK of updating key columns, or (if innodb_stats_include_delete_marked=OFF) updating delete-marks. innodb.innodb_stats_persistent: Renamed and extended from innodb.innodb_stats_del_mark. Reduced the unnecessarily large dataset from 262,144 to 32 rows. Test both values of the configuration parameter innodb_stats_include_delete_marked. Test that purge is updating the statistics. innodb_fts.innodb_fts_multiple_index: Adjust the result. The test is performing a ROLLBACK of an INSERT, which now affects the statistics. include/wait_all_purged.inc: Moved from innodb.innodb_truncate_debug to its own file.
9 years ago
MDEV-12698 innodb.innodb_stats_del_mark test failure In my merge of the MySQL fix for Oracle Bug#23333990 / WL#9513 I overlooked some subsequent revisions to the test, and I also failed to notice that the test is actually always failing. Oracle introduced the parameter innodb_stats_include_delete_marked but failed to consistently take it into account in FOREIGN KEY constraints that involve CASCADE or SET NULL. When innodb_stats_include_delete_marked=ON, obviously the purge of delete-marked records should update the statistics as well. One more omission was that statistics were never updated on ROLLBACK. We are fixing that as well, properly taking into account the parameter innodb_stats_include_delete_marked. dict_stats_analyze_index_level(): Simplify an expression. (Using the ternary operator with a constant operand is unnecessary obfuscation.) page_scan_method_t: Revert the change done by Oracle. Instead, examine srv_stats_include_delete_marked directly where it is needed. dict_stats_update_if_needed(): Renamed from row_update_statistics_if_needed(). row_update_for_mysql_using_upd_graph(): Assert that the table statistics are initialized, as guaranteed by ha_innobase::open(). Update the statistics in a consistent way, both for FOREIGN KEY triggers and for the main table. If FOREIGN KEY constraints exist, do not dereference a freed pointer, but cache the proper value of node->is_delete so that it matches prebuilt->table. row_purge_record_func(): Update statistics if innodb_stats_include_delete_marked=ON. row_undo_ins(): Update statistics (on ROLLBACK of a fresh INSERT). This is independent of the parameter; the record is not delete-marked. row_undo_mod(): Update statistics on the ROLLBACK of updating key columns, or (if innodb_stats_include_delete_marked=OFF) updating delete-marks. innodb.innodb_stats_persistent: Renamed and extended from innodb.innodb_stats_del_mark. Reduced the unnecessarily large dataset from 262,144 to 32 rows. Test both values of the configuration parameter innodb_stats_include_delete_marked. Test that purge is updating the statistics. innodb_fts.innodb_fts_multiple_index: Adjust the result. The test is performing a ROLLBACK of an INSERT, which now affects the statistics. include/wait_all_purged.inc: Moved from innodb.innodb_truncate_debug to its own file.
9 years ago
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
  1. /*****************************************************************************
  2. Copyright (c) 1997, 2017, Oracle and/or its affiliates. All Rights Reserved.
  3. Copyright (c) 2017, 2019, MariaDB Corporation.
  4. This program is free software; you can redistribute it and/or modify it under
  5. the terms of the GNU General Public License as published by the Free Software
  6. Foundation; version 2 of the License.
  7. This program is distributed in the hope that it will be useful, but WITHOUT
  8. ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
  9. FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
  10. You should have received a copy of the GNU General Public License along with
  11. this program; if not, write to the Free Software Foundation, Inc.,
  12. 51 Franklin Street, Suite 500, Boston, MA 02110-1335 USA
  13. *****************************************************************************/
  14. /**************************************************//**
  15. @file row/row0purge.cc
  16. Purge obsolete records
  17. Created 3/14/1997 Heikki Tuuri
  18. *******************************************************/
  19. #include "row0purge.h"
  20. #include "fsp0fsp.h"
  21. #include "mach0data.h"
  22. #include "dict0stats.h"
  23. #include "trx0rseg.h"
  24. #include "trx0trx.h"
  25. #include "trx0roll.h"
  26. #include "trx0undo.h"
  27. #include "trx0purge.h"
  28. #include "trx0rec.h"
  29. #include "que0que.h"
  30. #include "row0row.h"
  31. #include "row0upd.h"
  32. #include "row0vers.h"
  33. #include "row0mysql.h"
  34. #include "row0log.h"
  35. #include "log0log.h"
  36. #include "srv0mon.h"
  37. #include "srv0start.h"
  38. #include "handler.h"
  39. #include "ha_innodb.h"
  40. #include "fil0fil.h"
  41. /*************************************************************************
  42. IMPORTANT NOTE: Any operation that generates redo MUST check that there
  43. is enough space in the redo log before for that operation. This is
  44. done by calling log_free_check(). The reason for checking the
  45. availability of the redo log space before the start of the operation is
  46. that we MUST not hold any synchonization objects when performing the
  47. check.
  48. If you make a change in this module make sure that no codepath is
  49. introduced where a call to log_free_check() is bypassed. */
  50. /***********************************************************//**
  51. Repositions the pcur in the purge node on the clustered index record,
  52. if found. If the record is not found, close pcur.
  53. @return TRUE if the record was found */
  54. static
  55. ibool
  56. row_purge_reposition_pcur(
  57. /*======================*/
  58. ulint mode, /*!< in: latching mode */
  59. purge_node_t* node, /*!< in: row purge node */
  60. mtr_t* mtr) /*!< in: mtr */
  61. {
  62. if (node->found_clust) {
  63. ut_ad(node->validate_pcur());
  64. node->found_clust = btr_pcur_restore_position(mode, &node->pcur, mtr);
  65. } else {
  66. node->found_clust = row_search_on_row_ref(
  67. &node->pcur, mode, node->table, node->ref, mtr);
  68. if (node->found_clust) {
  69. btr_pcur_store_position(&node->pcur, mtr);
  70. }
  71. }
  72. /* Close the current cursor if we fail to position it correctly. */
  73. if (!node->found_clust) {
  74. btr_pcur_close(&node->pcur);
  75. }
  76. return(node->found_clust);
  77. }
  78. /***********************************************************//**
  79. Removes a delete marked clustered index record if possible.
  80. @retval true if the row was not found, or it was successfully removed
  81. @retval false if the row was modified after the delete marking */
  82. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  83. bool
  84. row_purge_remove_clust_if_poss_low(
  85. /*===============================*/
  86. purge_node_t* node, /*!< in/out: row purge node */
  87. ulint mode) /*!< in: BTR_MODIFY_LEAF or BTR_MODIFY_TREE */
  88. {
  89. dict_index_t* index;
  90. bool success = true;
  91. mtr_t mtr;
  92. rec_t* rec;
  93. mem_heap_t* heap = NULL;
  94. ulint* offsets;
  95. ulint offsets_[REC_OFFS_NORMAL_SIZE];
  96. rec_offs_init(offsets_);
  97. ut_ad(rw_lock_own(dict_operation_lock, RW_LOCK_S)
  98. || node->vcol_info.is_used());
  99. index = dict_table_get_first_index(node->table);
  100. log_free_check();
  101. mtr_start(&mtr);
  102. index->set_modified(mtr);
  103. if (!row_purge_reposition_pcur(mode, node, &mtr)) {
  104. /* The record was already removed. */
  105. goto func_exit;
  106. }
  107. rec = btr_pcur_get_rec(&node->pcur);
  108. offsets = rec_get_offsets(
  109. rec, index, offsets_, true, ULINT_UNDEFINED, &heap);
  110. if (node->roll_ptr != row_get_rec_roll_ptr(rec, index, offsets)) {
  111. /* Someone else has modified the record later: do not remove */
  112. goto func_exit;
  113. }
  114. ut_ad(rec_get_deleted_flag(rec, rec_offs_comp(offsets)));
  115. /* In delete-marked records, DB_TRX_ID must
  116. always refer to an existing undo log record. */
  117. ut_ad(row_get_rec_trx_id(rec, index, offsets));
  118. if (mode == BTR_MODIFY_LEAF) {
  119. success = btr_cur_optimistic_delete(
  120. btr_pcur_get_btr_cur(&node->pcur), 0, &mtr);
  121. } else {
  122. dberr_t err;
  123. ut_ad(mode == (BTR_MODIFY_TREE | BTR_LATCH_FOR_DELETE));
  124. btr_cur_pessimistic_delete(
  125. &err, FALSE, btr_pcur_get_btr_cur(&node->pcur), 0,
  126. false, &mtr);
  127. switch (err) {
  128. case DB_SUCCESS:
  129. break;
  130. case DB_OUT_OF_FILE_SPACE:
  131. success = false;
  132. break;
  133. default:
  134. ut_error;
  135. }
  136. }
  137. func_exit:
  138. if (heap) {
  139. mem_heap_free(heap);
  140. }
  141. /* Persistent cursor is closed if reposition fails. */
  142. if (node->found_clust) {
  143. btr_pcur_commit_specify_mtr(&node->pcur, &mtr);
  144. } else {
  145. mtr_commit(&mtr);
  146. }
  147. return(success);
  148. }
  149. /***********************************************************//**
  150. Removes a clustered index record if it has not been modified after the delete
  151. marking.
  152. @retval true if the row was not found, or it was successfully removed
  153. @retval false the purge needs to be suspended because of running out
  154. of file space. */
  155. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  156. bool
  157. row_purge_remove_clust_if_poss(
  158. /*===========================*/
  159. purge_node_t* node) /*!< in/out: row purge node */
  160. {
  161. if (row_purge_remove_clust_if_poss_low(node, BTR_MODIFY_LEAF)) {
  162. return(true);
  163. }
  164. for (ulint n_tries = 0;
  165. n_tries < BTR_CUR_RETRY_DELETE_N_TIMES;
  166. n_tries++) {
  167. if (row_purge_remove_clust_if_poss_low(
  168. node, BTR_MODIFY_TREE | BTR_LATCH_FOR_DELETE)) {
  169. return(true);
  170. }
  171. os_thread_sleep(BTR_CUR_RETRY_SLEEP_TIME);
  172. }
  173. return(false);
  174. }
  175. /** Tries to store secondary index cursor before openin mysql table for
  176. virtual index condition computation.
  177. @param[in,out] node row purge node
  178. @param[in] index secondary index
  179. @param[in,out] sec_pcur secondary index cursor
  180. @param[in,out] sec_mtr mini-transaction which holds
  181. secondary index entry */
  182. static void row_purge_store_vsec_cur(
  183. purge_node_t* node,
  184. dict_index_t* index,
  185. btr_pcur_t* sec_pcur,
  186. mtr_t* sec_mtr)
  187. {
  188. row_purge_reposition_pcur(BTR_SEARCH_LEAF, node, sec_mtr);
  189. if (!node->found_clust) {
  190. return;
  191. }
  192. node->vcol_info.set_requested();
  193. btr_pcur_store_position(sec_pcur, sec_mtr);
  194. btr_pcurs_commit_specify_mtr(&node->pcur, sec_pcur, sec_mtr);
  195. }
  196. /** Tries to restore secondary index cursor after opening the mysql table
  197. @param[in,out] node row purge node
  198. @param[in] index secondary index
  199. @param[in,out] sec_mtr mini-transaction which holds secondary index entry
  200. @param[in] is_tree true=pessimistic purge,
  201. false=optimistic (leaf-page only)
  202. @return false in case of restore failure. */
  203. static bool row_purge_restore_vsec_cur(
  204. purge_node_t* node,
  205. dict_index_t* index,
  206. btr_pcur_t* sec_pcur,
  207. mtr_t* sec_mtr,
  208. bool is_tree)
  209. {
  210. sec_mtr->start();
  211. index->set_modified(*sec_mtr);
  212. return btr_pcur_restore_position(
  213. is_tree ? BTR_PURGE_TREE : BTR_PURGE_LEAF,
  214. sec_pcur, sec_mtr);
  215. }
  216. /** Determines if it is possible to remove a secondary index entry.
  217. Removal is possible if the secondary index entry does not refer to any
  218. not delete marked version of a clustered index record where DB_TRX_ID
  219. is newer than the purge view.
  220. NOTE: This function should only be called by the purge thread, only
  221. while holding a latch on the leaf page of the secondary index entry
  222. (or keeping the buffer pool watch on the page). It is possible that
  223. this function first returns true and then false, if a user transaction
  224. inserts a record that the secondary index entry would refer to.
  225. However, in that case, the user transaction would also re-insert the
  226. secondary index entry after purge has removed it and released the leaf
  227. page latch.
  228. @param[in,out] node row purge node
  229. @param[in] index secondary index
  230. @param[in] entry secondary index entry
  231. @param[in,out] sec_pcur secondary index cursor or NULL
  232. if it is called for purge buffering
  233. operation.
  234. @param[in,out] sec_mtr mini-transaction which holds
  235. secondary index entry or NULL if it is
  236. called for purge buffering operation.
  237. @param[in] is_tree true=pessimistic purge,
  238. false=optimistic (leaf-page only)
  239. @return true if the secondary index record can be purged */
  240. bool
  241. row_purge_poss_sec(
  242. purge_node_t* node,
  243. dict_index_t* index,
  244. const dtuple_t* entry,
  245. btr_pcur_t* sec_pcur,
  246. mtr_t* sec_mtr,
  247. bool is_tree)
  248. {
  249. bool can_delete;
  250. mtr_t mtr;
  251. ut_ad(!dict_index_is_clust(index));
  252. const bool store_cur = sec_mtr && !node->vcol_info.is_used()
  253. && dict_index_has_virtual(index);
  254. if (store_cur) {
  255. row_purge_store_vsec_cur(node, index, sec_pcur, sec_mtr);
  256. ut_ad(sec_mtr->has_committed()
  257. == node->vcol_info.is_requested());
  258. /* The PRIMARY KEY value was not found in the clustered
  259. index. The secondary index record found. We can purge
  260. the secondary index record. */
  261. if (!node->vcol_info.is_requested()) {
  262. ut_ad(!node->found_clust);
  263. return true;
  264. }
  265. }
  266. retry_purge_sec:
  267. mtr_start(&mtr);
  268. can_delete = !row_purge_reposition_pcur(BTR_SEARCH_LEAF, node, &mtr)
  269. || !row_vers_old_has_index_entry(true,
  270. btr_pcur_get_rec(&node->pcur),
  271. &mtr, index, entry,
  272. node->roll_ptr, node->trx_id,
  273. &node->vcol_info);
  274. if (node->vcol_info.is_first_fetch()) {
  275. ut_ad(store_cur);
  276. const TABLE* t= node->vcol_info.table();
  277. DBUG_LOG("purge", "retry " << t
  278. << (is_tree ? " tree" : " leaf")
  279. << index->name << "," << index->table->name
  280. << ": " << rec_printer(entry).str());
  281. ut_ad(mtr.has_committed());
  282. if (t) {
  283. node->vcol_info.set_used();
  284. goto retry_purge_sec;
  285. }
  286. node->table = NULL;
  287. sec_pcur = NULL;
  288. return false;
  289. }
  290. /* Persistent cursor is closed if reposition fails. */
  291. if (node->found_clust) {
  292. btr_pcur_commit_specify_mtr(&node->pcur, &mtr);
  293. } else {
  294. mtr.commit();
  295. }
  296. ut_ad(mtr.has_committed());
  297. /* If the virtual column info is not used then reset the virtual column
  298. info. */
  299. if (node->vcol_info.is_requested()
  300. && !node->vcol_info.is_used()) {
  301. node->vcol_info.reset();
  302. }
  303. if (store_cur && !row_purge_restore_vsec_cur(
  304. node, index, sec_pcur, sec_mtr, is_tree)) {
  305. return false;
  306. }
  307. return can_delete;
  308. }
  309. /***************************************************************
  310. Removes a secondary index entry if possible, by modifying the
  311. index tree. Does not try to buffer the delete.
  312. @return TRUE if success or if not found */
  313. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  314. ibool
  315. row_purge_remove_sec_if_poss_tree(
  316. /*==============================*/
  317. purge_node_t* node, /*!< in: row purge node */
  318. dict_index_t* index, /*!< in: index */
  319. const dtuple_t* entry) /*!< in: index entry */
  320. {
  321. btr_pcur_t pcur;
  322. ibool success = TRUE;
  323. dberr_t err;
  324. mtr_t mtr;
  325. enum row_search_result search_result;
  326. log_free_check();
  327. mtr_start(&mtr);
  328. index->set_modified(mtr);
  329. if (!index->is_committed()) {
  330. /* The index->online_status may change if the index is
  331. or was being created online, but not committed yet. It
  332. is protected by index->lock. */
  333. mtr_sx_lock(dict_index_get_lock(index), &mtr);
  334. if (dict_index_is_online_ddl(index)) {
  335. /* Online secondary index creation will not
  336. copy any delete-marked records. Therefore
  337. there is nothing to be purged. We must also
  338. skip the purge when a completed index is
  339. dropped by rollback_inplace_alter_table(). */
  340. goto func_exit_no_pcur;
  341. }
  342. } else {
  343. /* For secondary indexes,
  344. index->online_status==ONLINE_INDEX_COMPLETE if
  345. index->is_committed(). */
  346. ut_ad(!dict_index_is_online_ddl(index));
  347. }
  348. search_result = row_search_index_entry(
  349. index, entry,
  350. BTR_MODIFY_TREE | BTR_LATCH_FOR_DELETE,
  351. &pcur, &mtr);
  352. switch (search_result) {
  353. case ROW_NOT_FOUND:
  354. /* Not found. This is a legitimate condition. In a
  355. rollback, InnoDB will remove secondary recs that would
  356. be purged anyway. Then the actual purge will not find
  357. the secondary index record. Also, the purge itself is
  358. eager: if it comes to consider a secondary index
  359. record, and notices it does not need to exist in the
  360. index, it will remove it. Then if/when the purge
  361. comes to consider the secondary index record a second
  362. time, it will not exist any more in the index. */
  363. /* fputs("PURGE:........sec entry not found\n", stderr); */
  364. /* dtuple_print(stderr, entry); */
  365. goto func_exit;
  366. case ROW_FOUND:
  367. break;
  368. case ROW_BUFFERED:
  369. case ROW_NOT_DELETED_REF:
  370. /* These are invalid outcomes, because the mode passed
  371. to row_search_index_entry() did not include any of the
  372. flags BTR_INSERT, BTR_DELETE, or BTR_DELETE_MARK. */
  373. ut_error;
  374. }
  375. /* We should remove the index record if no later version of the row,
  376. which cannot be purged yet, requires its existence. If some requires,
  377. we should do nothing. */
  378. if (row_purge_poss_sec(node, index, entry, &pcur, &mtr, true)) {
  379. /* Remove the index record, which should have been
  380. marked for deletion. */
  381. if (!rec_get_deleted_flag(btr_cur_get_rec(
  382. btr_pcur_get_btr_cur(&pcur)),
  383. dict_table_is_comp(index->table))) {
  384. ib::error()
  385. << "tried to purge non-delete-marked record"
  386. " in index " << index->name
  387. << " of table " << index->table->name
  388. << ": tuple: " << *entry
  389. << ", record: " << rec_index_print(
  390. btr_cur_get_rec(
  391. btr_pcur_get_btr_cur(&pcur)),
  392. index);
  393. ut_ad(0);
  394. goto func_exit;
  395. }
  396. btr_cur_pessimistic_delete(&err, FALSE,
  397. btr_pcur_get_btr_cur(&pcur),
  398. 0, false, &mtr);
  399. switch (UNIV_EXPECT(err, DB_SUCCESS)) {
  400. case DB_SUCCESS:
  401. break;
  402. case DB_OUT_OF_FILE_SPACE:
  403. success = FALSE;
  404. break;
  405. default:
  406. ut_error;
  407. }
  408. }
  409. if (node->vcol_op_failed()) {
  410. ut_ad(mtr.has_committed());
  411. ut_ad(!pcur.old_rec_buf);
  412. ut_ad(pcur.pos_state == BTR_PCUR_NOT_POSITIONED);
  413. return false;
  414. }
  415. func_exit:
  416. btr_pcur_close(&pcur);
  417. func_exit_no_pcur:
  418. mtr_commit(&mtr);
  419. return(success);
  420. }
  421. /***************************************************************
  422. Removes a secondary index entry without modifying the index tree,
  423. if possible.
  424. @retval true if success or if not found
  425. @retval false if row_purge_remove_sec_if_poss_tree() should be invoked */
  426. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  427. bool
  428. row_purge_remove_sec_if_poss_leaf(
  429. /*==============================*/
  430. purge_node_t* node, /*!< in: row purge node */
  431. dict_index_t* index, /*!< in: index */
  432. const dtuple_t* entry) /*!< in: index entry */
  433. {
  434. mtr_t mtr;
  435. btr_pcur_t pcur;
  436. enum btr_latch_mode mode;
  437. enum row_search_result search_result;
  438. bool success = true;
  439. log_free_check();
  440. ut_ad(index->table == node->table);
  441. ut_ad(!index->table->is_temporary());
  442. mtr_start(&mtr);
  443. index->set_modified(mtr);
  444. if (!index->is_committed()) {
  445. /* For uncommitted spatial index, we also skip the purge. */
  446. if (dict_index_is_spatial(index)) {
  447. goto func_exit_no_pcur;
  448. }
  449. /* The index->online_status may change if the the
  450. index is or was being created online, but not
  451. committed yet. It is protected by index->lock. */
  452. mtr_s_lock(dict_index_get_lock(index), &mtr);
  453. if (dict_index_is_online_ddl(index)) {
  454. /* Online secondary index creation will not
  455. copy any delete-marked records. Therefore
  456. there is nothing to be purged. We must also
  457. skip the purge when a completed index is
  458. dropped by rollback_inplace_alter_table(). */
  459. goto func_exit_no_pcur;
  460. }
  461. mode = BTR_PURGE_LEAF_ALREADY_S_LATCHED;
  462. } else {
  463. /* For secondary indexes,
  464. index->online_status==ONLINE_INDEX_COMPLETE if
  465. index->is_committed(). */
  466. ut_ad(!dict_index_is_online_ddl(index));
  467. /* Change buffering is disabled for spatial index and
  468. virtual index. */
  469. mode = (dict_index_is_spatial(index)
  470. || dict_index_has_virtual(index))
  471. ? BTR_MODIFY_LEAF
  472. : BTR_PURGE_LEAF;
  473. }
  474. /* Set the purge node for the call to row_purge_poss_sec(). */
  475. pcur.btr_cur.purge_node = node;
  476. if (dict_index_is_spatial(index)) {
  477. rw_lock_sx_lock(dict_index_get_lock(index));
  478. pcur.btr_cur.thr = NULL;
  479. } else {
  480. /* Set the query thread, so that ibuf_insert_low() will be
  481. able to invoke thd_get_trx(). */
  482. pcur.btr_cur.thr = static_cast<que_thr_t*>(
  483. que_node_get_parent(node));
  484. }
  485. search_result = row_search_index_entry(
  486. index, entry, mode, &pcur, &mtr);
  487. if (dict_index_is_spatial(index)) {
  488. rw_lock_sx_unlock(dict_index_get_lock(index));
  489. }
  490. switch (search_result) {
  491. case ROW_FOUND:
  492. /* Before attempting to purge a record, check
  493. if it is safe to do so. */
  494. if (row_purge_poss_sec(node, index, entry, &pcur, &mtr, false)) {
  495. btr_cur_t* btr_cur = btr_pcur_get_btr_cur(&pcur);
  496. /* Only delete-marked records should be purged. */
  497. if (!rec_get_deleted_flag(
  498. btr_cur_get_rec(btr_cur),
  499. dict_table_is_comp(index->table))) {
  500. ib::error()
  501. << "tried to purge non-delete-marked"
  502. " record" " in index " << index->name
  503. << " of table " << index->table->name
  504. << ": tuple: " << *entry
  505. << ", record: "
  506. << rec_index_print(
  507. btr_cur_get_rec(btr_cur),
  508. index);
  509. ut_ad(0);
  510. btr_pcur_close(&pcur);
  511. goto func_exit_no_pcur;
  512. }
  513. if (dict_index_is_spatial(index)) {
  514. const page_t* page;
  515. const trx_t* trx = NULL;
  516. if (btr_cur->rtr_info != NULL
  517. && btr_cur->rtr_info->thr != NULL) {
  518. trx = thr_get_trx(
  519. btr_cur->rtr_info->thr);
  520. }
  521. page = btr_cur_get_page(btr_cur);
  522. if (!lock_test_prdt_page_lock(
  523. trx,
  524. page_get_space_id(page),
  525. page_get_page_no(page))
  526. && page_get_n_recs(page) < 2
  527. && btr_cur_get_block(btr_cur)
  528. ->page.id.page_no() !=
  529. dict_index_get_page(index)) {
  530. /* this is the last record on page,
  531. and it has a "page" lock on it,
  532. which mean search is still depending
  533. on it, so do not delete */
  534. DBUG_LOG("purge",
  535. "skip purging last"
  536. " record on page "
  537. << btr_cur_get_block(btr_cur)
  538. ->page.id);
  539. btr_pcur_close(&pcur);
  540. mtr_commit(&mtr);
  541. return(success);
  542. }
  543. }
  544. if (!btr_cur_optimistic_delete(btr_cur, 0, &mtr)) {
  545. /* The index entry could not be deleted. */
  546. success = false;
  547. }
  548. }
  549. if (node->vcol_op_failed()) {
  550. btr_pcur_close(&pcur);
  551. return false;
  552. }
  553. /* (The index entry is still needed,
  554. or the deletion succeeded) */
  555. /* fall through */
  556. case ROW_NOT_DELETED_REF:
  557. /* The index entry is still needed. */
  558. case ROW_BUFFERED:
  559. /* The deletion was buffered. */
  560. case ROW_NOT_FOUND:
  561. /* The index entry does not exist, nothing to do. */
  562. btr_pcur_close(&pcur);
  563. func_exit_no_pcur:
  564. mtr_commit(&mtr);
  565. return(success);
  566. }
  567. ut_error;
  568. return(false);
  569. }
  570. /***********************************************************//**
  571. Removes a secondary index entry if possible. */
  572. UNIV_INLINE MY_ATTRIBUTE((nonnull(1,2)))
  573. void
  574. row_purge_remove_sec_if_poss(
  575. /*=========================*/
  576. purge_node_t* node, /*!< in: row purge node */
  577. dict_index_t* index, /*!< in: index */
  578. const dtuple_t* entry) /*!< in: index entry */
  579. {
  580. ibool success;
  581. ulint n_tries = 0;
  582. /* fputs("Purge: Removing secondary record\n", stderr); */
  583. if (!entry) {
  584. /* The node->row must have lacked some fields of this
  585. index. This is possible when the undo log record was
  586. written before this index was created. */
  587. return;
  588. }
  589. if (row_purge_remove_sec_if_poss_leaf(node, index, entry)) {
  590. return;
  591. }
  592. retry:
  593. if (node->vcol_op_failed()) {
  594. return;
  595. }
  596. success = row_purge_remove_sec_if_poss_tree(node, index, entry);
  597. /* The delete operation may fail if we have little
  598. file space left: TODO: easiest to crash the database
  599. and restart with more file space */
  600. if (!success && n_tries < BTR_CUR_RETRY_DELETE_N_TIMES) {
  601. n_tries++;
  602. os_thread_sleep(BTR_CUR_RETRY_SLEEP_TIME);
  603. goto retry;
  604. }
  605. ut_a(success);
  606. }
  607. /** Skip uncommitted virtual indexes on newly added virtual column.
  608. @param[in,out] index dict index object */
  609. static
  610. inline
  611. void
  612. row_purge_skip_uncommitted_virtual_index(
  613. dict_index_t*& index)
  614. {
  615. /* We need to skip virtual indexes which is not
  616. committed yet. It's safe because these indexes are
  617. newly created by alter table, and because we do
  618. not support LOCK=NONE when adding an index on newly
  619. added virtual column.*/
  620. while (index != NULL && dict_index_has_virtual(index)
  621. && !index->is_committed() && index->has_new_v_col) {
  622. index = dict_table_get_next_index(index);
  623. }
  624. }
  625. /***********************************************************//**
  626. Purges a delete marking of a record.
  627. @retval true if the row was not found, or it was successfully removed
  628. @retval false the purge needs to be suspended because of
  629. running out of file space */
  630. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  631. bool
  632. row_purge_del_mark(
  633. /*===============*/
  634. purge_node_t* node) /*!< in/out: row purge node */
  635. {
  636. mem_heap_t* heap;
  637. heap = mem_heap_create(1024);
  638. while (node->index != NULL) {
  639. /* skip corrupted secondary index */
  640. dict_table_skip_corrupt_index(node->index);
  641. row_purge_skip_uncommitted_virtual_index(node->index);
  642. if (!node->index) {
  643. break;
  644. }
  645. if (node->index->type != DICT_FTS) {
  646. dtuple_t* entry = row_build_index_entry_low(
  647. node->row, NULL, node->index,
  648. heap, ROW_BUILD_FOR_PURGE);
  649. row_purge_remove_sec_if_poss(node, node->index, entry);
  650. if (node->vcol_op_failed()) {
  651. mem_heap_free(heap);
  652. return false;
  653. }
  654. mem_heap_empty(heap);
  655. }
  656. node->index = dict_table_get_next_index(node->index);
  657. }
  658. mem_heap_free(heap);
  659. return(row_purge_remove_clust_if_poss(node));
  660. }
  661. /** Reset DB_TRX_ID, DB_ROLL_PTR of a clustered index record
  662. whose old history can no longer be observed.
  663. @param[in,out] node purge node
  664. @param[in,out] mtr mini-transaction (will be started and committed) */
  665. static void row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr)
  666. {
  667. ut_ad(rw_lock_own(dict_operation_lock, RW_LOCK_S)
  668. || node->vcol_info.is_used());
  669. /* Reset DB_TRX_ID, DB_ROLL_PTR for old records. */
  670. mtr->start();
  671. if (row_purge_reposition_pcur(BTR_MODIFY_LEAF, node, mtr)) {
  672. dict_index_t* index = dict_table_get_first_index(
  673. node->table);
  674. ulint trx_id_pos = index->n_uniq ? index->n_uniq : 1;
  675. rec_t* rec = btr_pcur_get_rec(&node->pcur);
  676. mem_heap_t* heap = NULL;
  677. /* Reserve enough offsets for the PRIMARY KEY and 2 columns
  678. so that we can access DB_TRX_ID, DB_ROLL_PTR. */
  679. ulint offsets_[REC_OFFS_HEADER_SIZE + MAX_REF_PARTS + 2];
  680. rec_offs_init(offsets_);
  681. ulint* offsets = rec_get_offsets(
  682. rec, index, offsets_, true, trx_id_pos + 2, &heap);
  683. ut_ad(heap == NULL);
  684. ut_ad(dict_index_get_nth_field(index, trx_id_pos)
  685. ->col->mtype == DATA_SYS);
  686. ut_ad(dict_index_get_nth_field(index, trx_id_pos)
  687. ->col->prtype == (DATA_TRX_ID | DATA_NOT_NULL));
  688. ut_ad(dict_index_get_nth_field(index, trx_id_pos + 1)
  689. ->col->mtype == DATA_SYS);
  690. ut_ad(dict_index_get_nth_field(index, trx_id_pos + 1)
  691. ->col->prtype == (DATA_ROLL_PTR | DATA_NOT_NULL));
  692. /* Only update the record if DB_ROLL_PTR matches (the
  693. record has not been modified after this transaction
  694. became purgeable) */
  695. if (node->roll_ptr
  696. == row_get_rec_roll_ptr(rec, index, offsets)) {
  697. ut_ad(!rec_get_deleted_flag(rec,
  698. rec_offs_comp(offsets)));
  699. DBUG_LOG("purge", "reset DB_TRX_ID="
  700. << ib::hex(row_get_rec_trx_id(
  701. rec, index, offsets)));
  702. index->set_modified(*mtr);
  703. if (page_zip_des_t* page_zip
  704. = buf_block_get_page_zip(
  705. btr_pcur_get_block(&node->pcur))) {
  706. page_zip_write_trx_id_and_roll_ptr(
  707. page_zip, rec, offsets, trx_id_pos,
  708. 0, 1ULL << ROLL_PTR_INSERT_FLAG_POS,
  709. mtr);
  710. } else {
  711. ulint len;
  712. byte* ptr = rec_get_nth_field(
  713. rec, offsets, trx_id_pos, &len);
  714. ut_ad(len == DATA_TRX_ID_LEN);
  715. mlog_write_string(ptr, reset_trx_id,
  716. sizeof reset_trx_id, mtr);
  717. }
  718. }
  719. }
  720. mtr->commit();
  721. }
  722. /***********************************************************//**
  723. Purges an update of an existing record. Also purges an update of a delete
  724. marked record if that record contained an externally stored field. */
  725. static
  726. void
  727. row_purge_upd_exist_or_extern_func(
  728. /*===============================*/
  729. #ifdef UNIV_DEBUG
  730. const que_thr_t*thr, /*!< in: query thread */
  731. #endif /* UNIV_DEBUG */
  732. purge_node_t* node, /*!< in: row purge node */
  733. trx_undo_rec_t* undo_rec) /*!< in: record to purge */
  734. {
  735. mem_heap_t* heap;
  736. ut_ad(rw_lock_own(dict_operation_lock, RW_LOCK_S)
  737. || node->vcol_info.is_used());
  738. ut_ad(!node->table->skip_alter_undo);
  739. if (node->rec_type == TRX_UNDO_UPD_DEL_REC
  740. || (node->cmpl_info & UPD_NODE_NO_ORD_CHANGE)) {
  741. goto skip_secondaries;
  742. }
  743. heap = mem_heap_create(1024);
  744. while (node->index != NULL) {
  745. dict_table_skip_corrupt_index(node->index);
  746. row_purge_skip_uncommitted_virtual_index(node->index);
  747. if (!node->index) {
  748. break;
  749. }
  750. if (row_upd_changes_ord_field_binary(node->index, node->update,
  751. thr, NULL, NULL)) {
  752. /* Build the older version of the index entry */
  753. dtuple_t* entry = row_build_index_entry_low(
  754. node->row, NULL, node->index,
  755. heap, ROW_BUILD_FOR_PURGE);
  756. row_purge_remove_sec_if_poss(node, node->index, entry);
  757. mem_heap_empty(heap);
  758. }
  759. node->index = dict_table_get_next_index(node->index);
  760. }
  761. mem_heap_free(heap);
  762. skip_secondaries:
  763. mtr_t mtr;
  764. dict_index_t* index = dict_table_get_first_index(node->table);
  765. /* Free possible externally stored fields */
  766. for (ulint i = 0; i < upd_get_n_fields(node->update); i++) {
  767. const upd_field_t* ufield
  768. = upd_get_nth_field(node->update, i);
  769. if (dfield_is_ext(&ufield->new_val)) {
  770. trx_rseg_t* rseg;
  771. buf_block_t* block;
  772. ulint internal_offset;
  773. byte* data_field;
  774. ibool is_insert;
  775. ulint rseg_id;
  776. ulint page_no;
  777. ulint offset;
  778. /* We use the fact that new_val points to
  779. undo_rec and get thus the offset of
  780. dfield data inside the undo record. Then we
  781. can calculate from node->roll_ptr the file
  782. address of the new_val data */
  783. internal_offset = ulint(
  784. static_cast<const byte*>
  785. (dfield_get_data(&ufield->new_val))
  786. - undo_rec);
  787. ut_a(internal_offset < srv_page_size);
  788. trx_undo_decode_roll_ptr(node->roll_ptr,
  789. &is_insert, &rseg_id,
  790. &page_no, &offset);
  791. rseg = trx_sys.rseg_array[rseg_id];
  792. ut_a(rseg != NULL);
  793. ut_ad(rseg->id == rseg_id);
  794. ut_ad(rseg->is_persistent());
  795. mtr_start(&mtr);
  796. /* We have to acquire an SX-latch to the clustered
  797. index tree (exclude other tree changes) */
  798. mtr_sx_lock(dict_index_get_lock(index), &mtr);
  799. index->set_modified(mtr);
  800. /* NOTE: we must also acquire an X-latch to the
  801. root page of the tree. We will need it when we
  802. free pages from the tree. If the tree is of height 1,
  803. the tree X-latch does NOT protect the root page,
  804. because it is also a leaf page. Since we will have a
  805. latch on an undo log page, we would break the
  806. latching order if we would only later latch the
  807. root page of such a tree! */
  808. btr_root_get(index, &mtr);
  809. block = buf_page_get(
  810. page_id_t(rseg->space->id, page_no),
  811. univ_page_size, RW_X_LATCH, &mtr);
  812. buf_block_dbg_add_level(block, SYNC_TRX_UNDO_PAGE);
  813. data_field = buf_block_get_frame(block)
  814. + offset + internal_offset;
  815. ut_a(dfield_get_len(&ufield->new_val)
  816. >= BTR_EXTERN_FIELD_REF_SIZE);
  817. btr_free_externally_stored_field(
  818. index,
  819. data_field + dfield_get_len(&ufield->new_val)
  820. - BTR_EXTERN_FIELD_REF_SIZE,
  821. NULL, NULL, NULL, 0, false, &mtr);
  822. mtr_commit(&mtr);
  823. }
  824. }
  825. row_purge_reset_trx_id(node, &mtr);
  826. }
  827. #ifdef UNIV_DEBUG
  828. # define row_purge_upd_exist_or_extern(thr,node,undo_rec) \
  829. row_purge_upd_exist_or_extern_func(thr,node,undo_rec)
  830. #else /* UNIV_DEBUG */
  831. # define row_purge_upd_exist_or_extern(thr,node,undo_rec) \
  832. row_purge_upd_exist_or_extern_func(node,undo_rec)
  833. #endif /* UNIV_DEBUG */
  834. /***********************************************************//**
  835. Parses the row reference and other info in a modify undo log record.
  836. @return true if purge operation required */
  837. static
  838. bool
  839. row_purge_parse_undo_rec(
  840. /*=====================*/
  841. purge_node_t* node, /*!< in: row undo node */
  842. trx_undo_rec_t* undo_rec, /*!< in: record to purge */
  843. bool* updated_extern, /*!< out: true if an externally
  844. stored field was updated */
  845. que_thr_t* thr) /*!< in: query thread */
  846. {
  847. dict_index_t* clust_index;
  848. byte* ptr;
  849. undo_no_t undo_no;
  850. table_id_t table_id;
  851. roll_ptr_t roll_ptr;
  852. ulint info_bits;
  853. ulint type;
  854. ut_ad(node != NULL);
  855. ut_ad(thr != NULL);
  856. ptr = trx_undo_rec_get_pars(
  857. undo_rec, &type, &node->cmpl_info,
  858. updated_extern, &undo_no, &table_id);
  859. node->rec_type = type;
  860. switch (type) {
  861. case TRX_UNDO_RENAME_TABLE:
  862. return false;
  863. case TRX_UNDO_INSERT_METADATA:
  864. case TRX_UNDO_INSERT_REC:
  865. break;
  866. default:
  867. #ifdef UNIV_DEBUG
  868. ut_ad(!"unknown undo log record type");
  869. return false;
  870. case TRX_UNDO_UPD_DEL_REC:
  871. case TRX_UNDO_UPD_EXIST_REC:
  872. case TRX_UNDO_DEL_MARK_REC:
  873. #endif /* UNIV_DEBUG */
  874. ptr = trx_undo_update_rec_get_sys_cols(ptr, &node->trx_id,
  875. &roll_ptr, &info_bits);
  876. break;
  877. }
  878. if (node->is_skipped(table_id)) {
  879. return false;
  880. }
  881. /* Prevent DROP TABLE etc. from running when we are doing the purge
  882. for this row */
  883. try_again:
  884. rw_lock_s_lock_inline(dict_operation_lock, 0, __FILE__, __LINE__);
  885. node->table = dict_table_open_on_id(
  886. table_id, FALSE, DICT_TABLE_OP_NORMAL);
  887. trx_id_t trx_id;
  888. if (node->table == NULL) {
  889. /* The table has been dropped: no need to do purge */
  890. trx_id = TRX_ID_MAX;
  891. goto err_exit;
  892. }
  893. ut_ad(!node->table->is_temporary());
  894. if (!fil_table_accessible(node->table)) {
  895. goto inaccessible;
  896. }
  897. switch (type) {
  898. case TRX_UNDO_INSERT_METADATA:
  899. case TRX_UNDO_INSERT_REC:
  900. break;
  901. default:
  902. if (!node->table->n_v_cols || node->table->vc_templ
  903. || !dict_table_has_indexed_v_cols(node->table)) {
  904. break;
  905. }
  906. /* Need server fully up for virtual column computation */
  907. if (!mysqld_server_started) {
  908. dict_table_close(node->table, FALSE, FALSE);
  909. rw_lock_s_unlock(dict_operation_lock);
  910. if (srv_shutdown_state != SRV_SHUTDOWN_NONE) {
  911. return(false);
  912. }
  913. os_thread_sleep(1000000);
  914. goto try_again;
  915. }
  916. node->vcol_info.set_requested();
  917. node->vcol_info.set_used();
  918. node->vcol_info.set_table(innobase_init_vc_templ(node->table));
  919. node->vcol_info.set_used();
  920. }
  921. clust_index = dict_table_get_first_index(node->table);
  922. if (!clust_index || clust_index->is_corrupted()) {
  923. /* The table was corrupt in the data dictionary.
  924. dict_set_corrupted() works on an index, and
  925. we do not have an index to call it with. */
  926. inaccessible:
  927. DBUG_ASSERT(table_id == node->table->id);
  928. trx_id = node->table->def_trx_id;
  929. if (!trx_id) {
  930. trx_id = TRX_ID_MAX;
  931. }
  932. dict_table_close(node->table, FALSE, FALSE);
  933. node->table = NULL;
  934. err_exit:
  935. rw_lock_s_unlock(dict_operation_lock);
  936. if (table_id) {
  937. node->skip(table_id, trx_id);
  938. }
  939. return(false);
  940. }
  941. if (type == TRX_UNDO_INSERT_METADATA) {
  942. node->ref = &trx_undo_metadata;
  943. return(true);
  944. }
  945. ptr = trx_undo_rec_get_row_ref(ptr, clust_index, &(node->ref),
  946. node->heap);
  947. if (type == TRX_UNDO_INSERT_REC) {
  948. return(true);
  949. }
  950. ptr = trx_undo_update_rec_get_update(ptr, clust_index, type,
  951. node->trx_id,
  952. roll_ptr, info_bits,
  953. node->heap, &(node->update));
  954. /* Read to the partial row the fields that occur in indexes */
  955. if (!(node->cmpl_info & UPD_NODE_NO_ORD_CHANGE)) {
  956. ut_ad(!(node->update->info_bits & REC_INFO_MIN_REC_FLAG));
  957. ptr = trx_undo_rec_get_partial_row(
  958. ptr, clust_index, node->update, &node->row,
  959. type == TRX_UNDO_UPD_DEL_REC,
  960. node->heap);
  961. } else if (node->update->info_bits & REC_INFO_MIN_REC_FLAG) {
  962. node->ref = &trx_undo_metadata;
  963. }
  964. return(true);
  965. }
  966. /***********************************************************//**
  967. Purges the parsed record.
  968. @return true if purged, false if skipped */
  969. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  970. bool
  971. row_purge_record_func(
  972. /*==================*/
  973. purge_node_t* node, /*!< in: row purge node */
  974. trx_undo_rec_t* undo_rec, /*!< in: record to purge */
  975. #if defined UNIV_DEBUG || defined WITH_WSREP
  976. const que_thr_t*thr, /*!< in: query thread */
  977. #endif /* UNIV_DEBUG || WITH_WSREP */
  978. bool updated_extern) /*!< in: whether external columns
  979. were updated */
  980. {
  981. dict_index_t* clust_index;
  982. bool purged = true;
  983. ut_ad(!node->found_clust);
  984. ut_ad(!node->table->skip_alter_undo);
  985. clust_index = dict_table_get_first_index(node->table);
  986. node->index = dict_table_get_next_index(clust_index);
  987. ut_ad(!trx_undo_roll_ptr_is_insert(node->roll_ptr));
  988. switch (node->rec_type) {
  989. case TRX_UNDO_DEL_MARK_REC:
  990. purged = row_purge_del_mark(node);
  991. if (purged) {
  992. if (node->table->stat_initialized
  993. && srv_stats_include_delete_marked) {
  994. dict_stats_update_if_needed(
  995. node->table,
  996. thr->graph->trx->mysql_thd);
  997. }
  998. MONITOR_INC(MONITOR_N_DEL_ROW_PURGE);
  999. }
  1000. break;
  1001. case TRX_UNDO_INSERT_METADATA:
  1002. case TRX_UNDO_INSERT_REC:
  1003. node->roll_ptr |= 1ULL << ROLL_PTR_INSERT_FLAG_POS;
  1004. /* fall through */
  1005. default:
  1006. if (!updated_extern) {
  1007. mtr_t mtr;
  1008. row_purge_reset_trx_id(node, &mtr);
  1009. break;
  1010. }
  1011. /* fall through */
  1012. case TRX_UNDO_UPD_EXIST_REC:
  1013. row_purge_upd_exist_or_extern(thr, node, undo_rec);
  1014. MONITOR_INC(MONITOR_N_UPD_EXIST_EXTERN);
  1015. break;
  1016. }
  1017. if (node->found_clust) {
  1018. btr_pcur_close(&node->pcur);
  1019. node->found_clust = FALSE;
  1020. }
  1021. if (node->table != NULL) {
  1022. dict_table_close(node->table, FALSE, FALSE);
  1023. node->table = NULL;
  1024. }
  1025. return(purged);
  1026. }
  1027. #if defined UNIV_DEBUG || defined WITH_WSREP
  1028. # define row_purge_record(node,undo_rec,thr,updated_extern) \
  1029. row_purge_record_func(node,undo_rec,thr,updated_extern)
  1030. #else /* UNIV_DEBUG || WITH_WSREP */
  1031. # define row_purge_record(node,undo_rec,thr,updated_extern) \
  1032. row_purge_record_func(node,undo_rec,updated_extern)
  1033. #endif /* UNIV_DEBUG || WITH_WSREP */
  1034. /***********************************************************//**
  1035. Fetches an undo log record and does the purge for the recorded operation.
  1036. If none left, or the current purge completed, returns the control to the
  1037. parent node, which is always a query thread node. */
  1038. static MY_ATTRIBUTE((nonnull))
  1039. void
  1040. row_purge(
  1041. /*======*/
  1042. purge_node_t* node, /*!< in: row purge node */
  1043. trx_undo_rec_t* undo_rec, /*!< in: record to purge */
  1044. que_thr_t* thr) /*!< in: query thread */
  1045. {
  1046. if (undo_rec != &trx_purge_dummy_rec) {
  1047. bool updated_extern;
  1048. while (row_purge_parse_undo_rec(
  1049. node, undo_rec, &updated_extern, thr)) {
  1050. bool purged = row_purge_record(
  1051. node, undo_rec, thr, updated_extern);
  1052. if (!node->vcol_info.is_used()) {
  1053. rw_lock_s_unlock(dict_operation_lock);
  1054. }
  1055. ut_ad(!rw_lock_own(dict_operation_lock, RW_LOCK_S));
  1056. if (purged
  1057. || srv_shutdown_state != SRV_SHUTDOWN_NONE
  1058. || node->vcol_op_failed()) {
  1059. return;
  1060. }
  1061. /* Retry the purge in a second. */
  1062. os_thread_sleep(1000000);
  1063. }
  1064. }
  1065. }
  1066. /***********************************************************//**
  1067. Reset the purge query thread. */
  1068. UNIV_INLINE
  1069. void
  1070. row_purge_end(
  1071. /*==========*/
  1072. que_thr_t* thr) /*!< in: query thread */
  1073. {
  1074. ut_ad(thr);
  1075. thr->run_node = static_cast<purge_node_t*>(thr->run_node)->end();
  1076. ut_a(thr->run_node != NULL);
  1077. }
  1078. /***********************************************************//**
  1079. Does the purge operation for a single undo log record. This is a high-level
  1080. function used in an SQL execution graph.
  1081. @return query thread to run next or NULL */
  1082. que_thr_t*
  1083. row_purge_step(
  1084. /*===========*/
  1085. que_thr_t* thr) /*!< in: query thread */
  1086. {
  1087. purge_node_t* node;
  1088. ut_ad(thr);
  1089. node = static_cast<purge_node_t*>(thr->run_node);
  1090. node->start();
  1091. if (!(node->undo_recs == NULL || ib_vector_is_empty(node->undo_recs))) {
  1092. trx_purge_rec_t*purge_rec;
  1093. purge_rec = static_cast<trx_purge_rec_t*>(
  1094. ib_vector_pop(node->undo_recs));
  1095. node->roll_ptr = purge_rec->roll_ptr;
  1096. row_purge(node, purge_rec->undo_rec, thr);
  1097. if (ib_vector_is_empty(node->undo_recs)) {
  1098. row_purge_end(thr);
  1099. } else {
  1100. thr->run_node = node;
  1101. node->vcol_info.reset();
  1102. }
  1103. } else {
  1104. row_purge_end(thr);
  1105. }
  1106. innobase_reset_background_thd(thr_get_trx(thr)->mysql_thd);
  1107. return(thr);
  1108. }
  1109. #ifdef UNIV_DEBUG
  1110. /***********************************************************//**
  1111. Validate the persisent cursor. The purge node has two references
  1112. to the clustered index record - one via the ref member, and the
  1113. other via the persistent cursor. These two references must match
  1114. each other if the found_clust flag is set.
  1115. @return true if the stored copy of persistent cursor is consistent
  1116. with the ref member.*/
  1117. bool
  1118. purge_node_t::validate_pcur()
  1119. {
  1120. if (!found_clust) {
  1121. return(true);
  1122. }
  1123. if (index == NULL) {
  1124. return(true);
  1125. }
  1126. if (index->type == DICT_FTS) {
  1127. return(true);
  1128. }
  1129. if (!pcur.old_stored) {
  1130. return(true);
  1131. }
  1132. dict_index_t* clust_index = pcur.btr_cur.index;
  1133. ulint* offsets = rec_get_offsets(
  1134. pcur.old_rec, clust_index, NULL, true,
  1135. pcur.old_n_fields, &heap);
  1136. /* Here we are comparing the purge ref record and the stored initial
  1137. part in persistent cursor. Both cases we store n_uniq fields of the
  1138. cluster index and so it is fine to do the comparison. We note this
  1139. dependency here as pcur and ref belong to different modules. */
  1140. int st = cmp_dtuple_rec(ref, pcur.old_rec, offsets);
  1141. if (st != 0) {
  1142. ib::error() << "Purge node pcur validation failed";
  1143. ib::error() << rec_printer(ref).str();
  1144. ib::error() << rec_printer(pcur.old_rec, offsets).str();
  1145. return(false);
  1146. }
  1147. return(true);
  1148. }
  1149. #endif /* UNIV_DEBUG */