You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

2487 lines
70 KiB

10 years ago
Bug#24346574 PAGE CLEANER THREAD, ASSERT BLOCK->N_POINTERS == 0 btr_search_drop_page_hash_index(): Do not return before ensuring that block->index=NULL, even if !btr_search_enabled. We would typically still skip acquiring the AHI latch when the AHI is disabled, because block->index would already be NULL. Only if the AHI is in the process of being disabled, we would wait for the AHI latch and then notice that block->index=NULL and return. The above bug was a regression caused in MySQL 5.7.9 by the fix of Bug#21407023: DISABLING AHI SHOULD AVOID TAKING AHI LATCH The rest of this patch improves diagnostics by adding assertions. assert_block_ahi_valid(): A debug predicate for checking that block->n_pointers!=0 implies block->index!=NULL. assert_block_ahi_empty(): A debug predicate for checking that block->n_pointers==0. buf_block_init(): Instead of assigning block->n_pointers=0, assert_block_ahi_empty(block). buf_pool_clear_hash_index(): Clarify comments, and assign block->n_pointers=0 before assigning block->index=NULL. The wrong ordering could make block->n_pointers appear incorrect in debug assertions. This bug was introduced in MySQL 5.1.52 by Bug#13006367 62487: INNODB TAKES 3 MINUTES TO CLEAN UP THE ADAPTIVE HASH INDEX AT SHUTDOWN i_s_innodb_buffer_page_get_info(): Add a comment that the IS_HASHED column in the INFORMATION_SCHEMA views INNODB_BUFFER_POOL_PAGE and INNODB_BUFFER_PAGE_LRU may show false positives (there may be no pointers after all.) ha_insert_for_fold_func(), ha_delete_hash_node(), ha_search_and_update_if_found_func(): Use atomics for updating buf_block_t::n_pointers. While buf_block_t::index is always protected by btr_search_x_lock(index), in ha_insert_for_fold_func() the n_pointers-- may belong to another dict_index_t whose btr_search_latches[] we are not holding. RB: 13879 Reviewed-by: Jimmy Yang <jimmy.yang@oracle.com>
9 years ago
Bug#24346574 PAGE CLEANER THREAD, ASSERT BLOCK->N_POINTERS == 0 btr_search_drop_page_hash_index(): Do not return before ensuring that block->index=NULL, even if !btr_search_enabled. We would typically still skip acquiring the AHI latch when the AHI is disabled, because block->index would already be NULL. Only if the AHI is in the process of being disabled, we would wait for the AHI latch and then notice that block->index=NULL and return. The above bug was a regression caused in MySQL 5.7.9 by the fix of Bug#21407023: DISABLING AHI SHOULD AVOID TAKING AHI LATCH The rest of this patch improves diagnostics by adding assertions. assert_block_ahi_valid(): A debug predicate for checking that block->n_pointers!=0 implies block->index!=NULL. assert_block_ahi_empty(): A debug predicate for checking that block->n_pointers==0. buf_block_init(): Instead of assigning block->n_pointers=0, assert_block_ahi_empty(block). buf_pool_clear_hash_index(): Clarify comments, and assign block->n_pointers=0 before assigning block->index=NULL. The wrong ordering could make block->n_pointers appear incorrect in debug assertions. This bug was introduced in MySQL 5.1.52 by Bug#13006367 62487: INNODB TAKES 3 MINUTES TO CLEAN UP THE ADAPTIVE HASH INDEX AT SHUTDOWN i_s_innodb_buffer_page_get_info(): Add a comment that the IS_HASHED column in the INFORMATION_SCHEMA views INNODB_BUFFER_POOL_PAGE and INNODB_BUFFER_PAGE_LRU may show false positives (there may be no pointers after all.) ha_insert_for_fold_func(), ha_delete_hash_node(), ha_search_and_update_if_found_func(): Use atomics for updating buf_block_t::n_pointers. While buf_block_t::index is always protected by btr_search_x_lock(index), in ha_insert_for_fold_func() the n_pointers-- may belong to another dict_index_t whose btr_search_latches[] we are not holding. RB: 13879 Reviewed-by: Jimmy Yang <jimmy.yang@oracle.com>
9 years ago
Bug#24346574 PAGE CLEANER THREAD, ASSERT BLOCK->N_POINTERS == 0 btr_search_drop_page_hash_index(): Do not return before ensuring that block->index=NULL, even if !btr_search_enabled. We would typically still skip acquiring the AHI latch when the AHI is disabled, because block->index would already be NULL. Only if the AHI is in the process of being disabled, we would wait for the AHI latch and then notice that block->index=NULL and return. The above bug was a regression caused in MySQL 5.7.9 by the fix of Bug#21407023: DISABLING AHI SHOULD AVOID TAKING AHI LATCH The rest of this patch improves diagnostics by adding assertions. assert_block_ahi_valid(): A debug predicate for checking that block->n_pointers!=0 implies block->index!=NULL. assert_block_ahi_empty(): A debug predicate for checking that block->n_pointers==0. buf_block_init(): Instead of assigning block->n_pointers=0, assert_block_ahi_empty(block). buf_pool_clear_hash_index(): Clarify comments, and assign block->n_pointers=0 before assigning block->index=NULL. The wrong ordering could make block->n_pointers appear incorrect in debug assertions. This bug was introduced in MySQL 5.1.52 by Bug#13006367 62487: INNODB TAKES 3 MINUTES TO CLEAN UP THE ADAPTIVE HASH INDEX AT SHUTDOWN i_s_innodb_buffer_page_get_info(): Add a comment that the IS_HASHED column in the INFORMATION_SCHEMA views INNODB_BUFFER_POOL_PAGE and INNODB_BUFFER_PAGE_LRU may show false positives (there may be no pointers after all.) ha_insert_for_fold_func(), ha_delete_hash_node(), ha_search_and_update_if_found_func(): Use atomics for updating buf_block_t::n_pointers. While buf_block_t::index is always protected by btr_search_x_lock(index), in ha_insert_for_fold_func() the n_pointers-- may belong to another dict_index_t whose btr_search_latches[] we are not holding. RB: 13879 Reviewed-by: Jimmy Yang <jimmy.yang@oracle.com>
9 years ago
Bug#24346574 PAGE CLEANER THREAD, ASSERT BLOCK->N_POINTERS == 0 btr_search_drop_page_hash_index(): Do not return before ensuring that block->index=NULL, even if !btr_search_enabled. We would typically still skip acquiring the AHI latch when the AHI is disabled, because block->index would already be NULL. Only if the AHI is in the process of being disabled, we would wait for the AHI latch and then notice that block->index=NULL and return. The above bug was a regression caused in MySQL 5.7.9 by the fix of Bug#21407023: DISABLING AHI SHOULD AVOID TAKING AHI LATCH The rest of this patch improves diagnostics by adding assertions. assert_block_ahi_valid(): A debug predicate for checking that block->n_pointers!=0 implies block->index!=NULL. assert_block_ahi_empty(): A debug predicate for checking that block->n_pointers==0. buf_block_init(): Instead of assigning block->n_pointers=0, assert_block_ahi_empty(block). buf_pool_clear_hash_index(): Clarify comments, and assign block->n_pointers=0 before assigning block->index=NULL. The wrong ordering could make block->n_pointers appear incorrect in debug assertions. This bug was introduced in MySQL 5.1.52 by Bug#13006367 62487: INNODB TAKES 3 MINUTES TO CLEAN UP THE ADAPTIVE HASH INDEX AT SHUTDOWN i_s_innodb_buffer_page_get_info(): Add a comment that the IS_HASHED column in the INFORMATION_SCHEMA views INNODB_BUFFER_POOL_PAGE and INNODB_BUFFER_PAGE_LRU may show false positives (there may be no pointers after all.) ha_insert_for_fold_func(), ha_delete_hash_node(), ha_search_and_update_if_found_func(): Use atomics for updating buf_block_t::n_pointers. While buf_block_t::index is always protected by btr_search_x_lock(index), in ha_insert_for_fold_func() the n_pointers-- may belong to another dict_index_t whose btr_search_latches[] we are not holding. RB: 13879 Reviewed-by: Jimmy Yang <jimmy.yang@oracle.com>
9 years ago
MDEV-16283 ALTER TABLE...DISCARD TABLESPACE still takes long on a large buffer pool Also fixes MDEV-14727, MDEV-14491 InnoDB: Error: Waited for 5 secs for hash index ref_count (1) to drop to 0 by replacing the flawed wait logic in dict_index_remove_from_cache_low(). On DISCARD TABLESPACE, there is no need to drop the adaptive hash index. We must drop it on IMPORT TABLESPACE, and eventually on DROP TABLE or DROP INDEX. As long as the dict_index_t object remains in the cache and the table remains inaccessible, the adaptive hash index entries to orphaned pages would not do any harm. They would be dropped when buffer pool pages are reused for something else. btr_search_drop_page_hash_when_freed(), buf_LRU_drop_page_hash_batch(): Remove the parameter zip_size, and pass 0 to buf_page_get_gen(). buf_page_get_gen(): Ignore zip_size if mode==BUF_PEEK_IF_IN_POOL. buf_LRU_drop_page_hash_for_tablespace(): Drop the adaptive hash index even if the tablespace is inaccessible. buf_LRU_drop_page_hash_for_tablespace(): New global function, to drop the adaptive hash index. buf_LRU_flush_or_remove_pages(), fil_delete_tablespace(): Remove the parameter drop_ahi. dict_index_remove_from_cache_low(): Actively drop the adaptive hash index if entries exist. This should prevent InnoDB hangs on DROP TABLE or DROP INDEX. row_import_for_mysql(): Drop any adaptive hash index entries for the table. row_drop_table_for_mysql(): Drop any adaptive hash index for the table, except if the table resides in the system tablespace. (DISCARD TABLESPACE does not apply to the system tablespace, and we do no want to drop the adaptive hash index for other tables than the one that is being dropped.) row_truncate_table_for_mysql(): Drop any adaptive hash index entries for the table, except if the table resides in the system tablespace.
8 years ago
MDEV-16283 ALTER TABLE...DISCARD TABLESPACE still takes long on a large buffer pool Also fixes MDEV-14727, MDEV-14491 InnoDB: Error: Waited for 5 secs for hash index ref_count (1) to drop to 0 by replacing the flawed wait logic in dict_index_remove_from_cache_low(). On DISCARD TABLESPACE, there is no need to drop the adaptive hash index. We must drop it on IMPORT TABLESPACE, and eventually on DROP TABLE or DROP INDEX. As long as the dict_index_t object remains in the cache and the table remains inaccessible, the adaptive hash index entries to orphaned pages would not do any harm. They would be dropped when buffer pool pages are reused for something else. btr_search_drop_page_hash_when_freed(), buf_LRU_drop_page_hash_batch(): Remove the parameter zip_size, and pass 0 to buf_page_get_gen(). buf_page_get_gen(): Ignore zip_size if mode==BUF_PEEK_IF_IN_POOL. buf_LRU_drop_page_hash_for_tablespace(): Drop the adaptive hash index even if the tablespace is inaccessible. buf_LRU_drop_page_hash_for_tablespace(): New global function, to drop the adaptive hash index. buf_LRU_flush_or_remove_pages(), fil_delete_tablespace(): Remove the parameter drop_ahi. dict_index_remove_from_cache_low(): Actively drop the adaptive hash index if entries exist. This should prevent InnoDB hangs on DROP TABLE or DROP INDEX. row_import_for_mysql(): Drop any adaptive hash index entries for the table. row_drop_table_for_mysql(): Drop any adaptive hash index for the table, except if the table resides in the system tablespace. (DISCARD TABLESPACE does not apply to the system tablespace, and we do no want to drop the adaptive hash index for other tables than the one that is being dropped.) row_truncate_table_for_mysql(): Drop any adaptive hash index entries for the table, except if the table resides in the system tablespace.
8 years ago
MDEV-16283 ALTER TABLE...DISCARD TABLESPACE still takes long on a large buffer pool Also fixes MDEV-14727, MDEV-14491 InnoDB: Error: Waited for 5 secs for hash index ref_count (1) to drop to 0 by replacing the flawed wait logic in dict_index_remove_from_cache_low(). On DISCARD TABLESPACE, there is no need to drop the adaptive hash index. We must drop it on IMPORT TABLESPACE, and eventually on DROP TABLE or DROP INDEX. As long as the dict_index_t object remains in the cache and the table remains inaccessible, the adaptive hash index entries to orphaned pages would not do any harm. They would be dropped when buffer pool pages are reused for something else. btr_search_drop_page_hash_when_freed(), buf_LRU_drop_page_hash_batch(): Remove the parameter zip_size, and pass 0 to buf_page_get_gen(). buf_page_get_gen(): Ignore zip_size if mode==BUF_PEEK_IF_IN_POOL. buf_LRU_drop_page_hash_for_tablespace(): Drop the adaptive hash index even if the tablespace is inaccessible. buf_LRU_drop_page_hash_for_tablespace(): New global function, to drop the adaptive hash index. buf_LRU_flush_or_remove_pages(), fil_delete_tablespace(): Remove the parameter drop_ahi. dict_index_remove_from_cache_low(): Actively drop the adaptive hash index if entries exist. This should prevent InnoDB hangs on DROP TABLE or DROP INDEX. row_import_for_mysql(): Drop any adaptive hash index entries for the table. row_drop_table_for_mysql(): Drop any adaptive hash index for the table, except if the table resides in the system tablespace. (DISCARD TABLESPACE does not apply to the system tablespace, and we do no want to drop the adaptive hash index for other tables than the one that is being dropped.) row_truncate_table_for_mysql(): Drop any adaptive hash index entries for the table, except if the table resides in the system tablespace.
8 years ago
MDEV-16283 ALTER TABLE...DISCARD TABLESPACE still takes long on a large buffer pool Also fixes MDEV-14727, MDEV-14491 InnoDB: Error: Waited for 5 secs for hash index ref_count (1) to drop to 0 by replacing the flawed wait logic in dict_index_remove_from_cache_low(). On DISCARD TABLESPACE, there is no need to drop the adaptive hash index. We must drop it on IMPORT TABLESPACE, and eventually on DROP TABLE or DROP INDEX. As long as the dict_index_t object remains in the cache and the table remains inaccessible, the adaptive hash index entries to orphaned pages would not do any harm. They would be dropped when buffer pool pages are reused for something else. btr_search_drop_page_hash_when_freed(), buf_LRU_drop_page_hash_batch(): Remove the parameter zip_size, and pass 0 to buf_page_get_gen(). buf_page_get_gen(): Ignore zip_size if mode==BUF_PEEK_IF_IN_POOL. buf_LRU_drop_page_hash_for_tablespace(): Drop the adaptive hash index even if the tablespace is inaccessible. buf_LRU_drop_page_hash_for_tablespace(): New global function, to drop the adaptive hash index. buf_LRU_flush_or_remove_pages(), fil_delete_tablespace(): Remove the parameter drop_ahi. dict_index_remove_from_cache_low(): Actively drop the adaptive hash index if entries exist. This should prevent InnoDB hangs on DROP TABLE or DROP INDEX. row_import_for_mysql(): Drop any adaptive hash index entries for the table. row_drop_table_for_mysql(): Drop any adaptive hash index for the table, except if the table resides in the system tablespace. (DISCARD TABLESPACE does not apply to the system tablespace, and we do no want to drop the adaptive hash index for other tables than the one that is being dropped.) row_truncate_table_for_mysql(): Drop any adaptive hash index entries for the table, except if the table resides in the system tablespace.
8 years ago
MDEV-16283 ALTER TABLE...DISCARD TABLESPACE still takes long on a large buffer pool Also fixes MDEV-14727, MDEV-14491 InnoDB: Error: Waited for 5 secs for hash index ref_count (1) to drop to 0 by replacing the flawed wait logic in dict_index_remove_from_cache_low(). On DISCARD TABLESPACE, there is no need to drop the adaptive hash index. We must drop it on IMPORT TABLESPACE, and eventually on DROP TABLE or DROP INDEX. As long as the dict_index_t object remains in the cache and the table remains inaccessible, the adaptive hash index entries to orphaned pages would not do any harm. They would be dropped when buffer pool pages are reused for something else. btr_search_drop_page_hash_when_freed(), buf_LRU_drop_page_hash_batch(): Remove the parameter zip_size, and pass 0 to buf_page_get_gen(). buf_page_get_gen(): Ignore zip_size if mode==BUF_PEEK_IF_IN_POOL. buf_LRU_drop_page_hash_for_tablespace(): Drop the adaptive hash index even if the tablespace is inaccessible. buf_LRU_drop_page_hash_for_tablespace(): New global function, to drop the adaptive hash index. buf_LRU_flush_or_remove_pages(), fil_delete_tablespace(): Remove the parameter drop_ahi. dict_index_remove_from_cache_low(): Actively drop the adaptive hash index if entries exist. This should prevent InnoDB hangs on DROP TABLE or DROP INDEX. row_import_for_mysql(): Drop any adaptive hash index entries for the table. row_drop_table_for_mysql(): Drop any adaptive hash index for the table, except if the table resides in the system tablespace. (DISCARD TABLESPACE does not apply to the system tablespace, and we do no want to drop the adaptive hash index for other tables than the one that is being dropped.) row_truncate_table_for_mysql(): Drop any adaptive hash index entries for the table, except if the table resides in the system tablespace.
8 years ago
12 years ago
12 years ago
MDEV-13564 Mariabackup does not work with TRUNCATE Implement undo tablespace truncation via normal redo logging. Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name, CREATE, and DROP. Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2 is killed before the DROP operation is committed. If MariaDB Server 10.2 is killed during TRUNCATE, it is also possible that the old table was renamed to #sql-ib*.ibd but the data dictionary will refer to the table using the original name. In MariaDB Server 10.3, RENAME inside InnoDB is transactional, and #sql-* tables will be dropped on startup. So, this new TRUNCATE will be fully crash-safe in 10.3. ha_mroonga::wrapper_truncate(): Pass table options to the underlying storage engine, now that ha_innobase::truncate() will need them. rpl_slave_state::truncate_state_table(): Before truncating mysql.gtid_slave_pos, evict any cached table handles from the table definition cache, so that there will be no stale references to the old table after truncating. == TRUNCATE TABLE == WL#6501 in MySQL 5.7 introduced separate log files for implementing atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB undo and redo log. Some convoluted logic was added to the InnoDB crash recovery, and some extra synchronization (including a redo log checkpoint) was introduced to make this work. This synchronization has caused performance problems and race conditions, and the extra log files cannot be copied or applied by external backup programs. In order to support crash-upgrade from MariaDB 10.2, we will keep the logic for parsing and applying the extra log files, but we will no longer generate those files in TRUNCATE TABLE. A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE (with full redo and undo logging and proper rollback). This will be implemented in MDEV-14717. ha_innobase::truncate(): Invoke RENAME, create(), delete_table(). Because RENAME cannot be fully rolled back before MariaDB 10.3 due to missing undo logging, add some explicit rename-back in case the operation fails. ha_innobase::delete(): Introduce a variant that takes sqlcom as a parameter. In TRUNCATE TABLE, we do not want to touch any FOREIGN KEY constraints. ha_innobase::create(): Add the parameters file_per_table, trx. In TRUNCATE, the new table must be created in the same transaction that renames the old table. create_table_info_t::create_table_info_t(): Add the parameters file_per_table, trx. row_drop_table_for_mysql(): Replace a bool parameter with sqlcom. row_drop_table_after_create_fail(): New function, wrapping row_drop_table_for_mysql(). dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(), fil_prepare_for_truncate(), fil_reinit_space_header_for_table(), row_truncate_table_for_mysql(), TruncateLogger, row_truncate_prepare(), row_truncate_rollback(), row_truncate_complete(), row_truncate_fts(), row_truncate_update_system_tables(), row_truncate_foreign_key_checks(), row_truncate_sanity_checks(): Remove. row_upd_check_references_constraints(): Remove a check for TRUNCATE, now that the table is no longer truncated in place. The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some race-condition like scenarios. The test innodb-innodb.truncate does not use any synchronization. We add a redo log subformat to indicate backup-friendly format. MariaDB 10.4 will remove support for the old TRUNCATE logging, so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve limitations. == Undo tablespace truncation == MySQL 5.7 implements undo tablespace truncation. It is only possible when innodb_undo_tablespaces is set to at least 2. The logging is implemented similar to the WL#6501 TRUNCATE, that is, using separate log files and a redo log checkpoint. We can simply implement undo tablespace truncation within a single mini-transaction that reinitializes the undo log tablespace file. Unfortunately, due to the redo log format of some operations, currently, the total redo log written by undo tablespace truncation will be more than the combined size of the truncated undo tablespace. It should be acceptable to have a little more than 1 megabyte of log in a single mini-transaction. This will be fixed in MDEV-17138 in MariaDB Server 10.4. recv_sys_t: Add truncated_undo_spaces[] to remember for which undo tablespaces a MLOG_FILE_CREATE2 record was seen. namespace undo: Remove some unnecessary declarations. fil_space_t::is_being_truncated: Document that this flag now only applies to undo tablespaces. Remove some references. fil_space_t::is_stopping(): Do not refer to is_being_truncated. This check is for tablespaces of tables. Potentially used tablespaces are never truncated any more. buf_dblwr_process(): Suppress the out-of-bounds warning for undo tablespaces. fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero page number (new size of the tablespace in pages) to inform crash recovery that the undo tablespace size has been reduced. fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2 can be written for undo tablespaces (without .ibd file suffix) for a nonzero page number. os_file_truncate(): Add the parameter allow_shrink=false so that undo tablespaces can actually be shrunk using this function. fil_name_parse(): For undo tablespace truncation, buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[]. recv_read_in_area(): Avoid reading pages for which no redo log records remain buffered, after recv_addr_trim() removed them. trx_rseg_header_create(): Add a FIXME comment that we could write much less redo log. trx_undo_truncate_tablespace(): Reinitialize the undo tablespace in a single mini-transaction, which will be flushed to the redo log before the file size is trimmed. recv_addr_trim(): Discard any redo logs for pages that were logged after the new end of a file, before the truncation LSN. If the rec_list becomes empty, reduce n_addrs. After removing any affected records, actually truncate the file. recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before applying any log records. The undo tablespace files must be open at this point. buf_flush_or_remove_pages(), buf_flush_dirty_pages(), buf_LRU_flush_or_remove_pages(): Add a parameter for specifying the number of the first page to flush or remove (default 0). trx_purge_initiate_truncate(): Remove the log checkpoints, the extra logging, and some unnecessary crash points. Merge the code from trx_undo_truncate_tablespace(). First, flush all to-be-discarded pages (beyond the new end of the file), then trim the space->size to make the page allocation deterministic. At the only remaining crash injection point, flush the redo log, so that the recovery can be tested.
7 years ago
MDEV-13564 Mariabackup does not work with TRUNCATE Implement undo tablespace truncation via normal redo logging. Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name, CREATE, and DROP. Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2 is killed before the DROP operation is committed. If MariaDB Server 10.2 is killed during TRUNCATE, it is also possible that the old table was renamed to #sql-ib*.ibd but the data dictionary will refer to the table using the original name. In MariaDB Server 10.3, RENAME inside InnoDB is transactional, and #sql-* tables will be dropped on startup. So, this new TRUNCATE will be fully crash-safe in 10.3. ha_mroonga::wrapper_truncate(): Pass table options to the underlying storage engine, now that ha_innobase::truncate() will need them. rpl_slave_state::truncate_state_table(): Before truncating mysql.gtid_slave_pos, evict any cached table handles from the table definition cache, so that there will be no stale references to the old table after truncating. == TRUNCATE TABLE == WL#6501 in MySQL 5.7 introduced separate log files for implementing atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB undo and redo log. Some convoluted logic was added to the InnoDB crash recovery, and some extra synchronization (including a redo log checkpoint) was introduced to make this work. This synchronization has caused performance problems and race conditions, and the extra log files cannot be copied or applied by external backup programs. In order to support crash-upgrade from MariaDB 10.2, we will keep the logic for parsing and applying the extra log files, but we will no longer generate those files in TRUNCATE TABLE. A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE (with full redo and undo logging and proper rollback). This will be implemented in MDEV-14717. ha_innobase::truncate(): Invoke RENAME, create(), delete_table(). Because RENAME cannot be fully rolled back before MariaDB 10.3 due to missing undo logging, add some explicit rename-back in case the operation fails. ha_innobase::delete(): Introduce a variant that takes sqlcom as a parameter. In TRUNCATE TABLE, we do not want to touch any FOREIGN KEY constraints. ha_innobase::create(): Add the parameters file_per_table, trx. In TRUNCATE, the new table must be created in the same transaction that renames the old table. create_table_info_t::create_table_info_t(): Add the parameters file_per_table, trx. row_drop_table_for_mysql(): Replace a bool parameter with sqlcom. row_drop_table_after_create_fail(): New function, wrapping row_drop_table_for_mysql(). dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(), fil_prepare_for_truncate(), fil_reinit_space_header_for_table(), row_truncate_table_for_mysql(), TruncateLogger, row_truncate_prepare(), row_truncate_rollback(), row_truncate_complete(), row_truncate_fts(), row_truncate_update_system_tables(), row_truncate_foreign_key_checks(), row_truncate_sanity_checks(): Remove. row_upd_check_references_constraints(): Remove a check for TRUNCATE, now that the table is no longer truncated in place. The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some race-condition like scenarios. The test innodb-innodb.truncate does not use any synchronization. We add a redo log subformat to indicate backup-friendly format. MariaDB 10.4 will remove support for the old TRUNCATE logging, so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve limitations. == Undo tablespace truncation == MySQL 5.7 implements undo tablespace truncation. It is only possible when innodb_undo_tablespaces is set to at least 2. The logging is implemented similar to the WL#6501 TRUNCATE, that is, using separate log files and a redo log checkpoint. We can simply implement undo tablespace truncation within a single mini-transaction that reinitializes the undo log tablespace file. Unfortunately, due to the redo log format of some operations, currently, the total redo log written by undo tablespace truncation will be more than the combined size of the truncated undo tablespace. It should be acceptable to have a little more than 1 megabyte of log in a single mini-transaction. This will be fixed in MDEV-17138 in MariaDB Server 10.4. recv_sys_t: Add truncated_undo_spaces[] to remember for which undo tablespaces a MLOG_FILE_CREATE2 record was seen. namespace undo: Remove some unnecessary declarations. fil_space_t::is_being_truncated: Document that this flag now only applies to undo tablespaces. Remove some references. fil_space_t::is_stopping(): Do not refer to is_being_truncated. This check is for tablespaces of tables. Potentially used tablespaces are never truncated any more. buf_dblwr_process(): Suppress the out-of-bounds warning for undo tablespaces. fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero page number (new size of the tablespace in pages) to inform crash recovery that the undo tablespace size has been reduced. fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2 can be written for undo tablespaces (without .ibd file suffix) for a nonzero page number. os_file_truncate(): Add the parameter allow_shrink=false so that undo tablespaces can actually be shrunk using this function. fil_name_parse(): For undo tablespace truncation, buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[]. recv_read_in_area(): Avoid reading pages for which no redo log records remain buffered, after recv_addr_trim() removed them. trx_rseg_header_create(): Add a FIXME comment that we could write much less redo log. trx_undo_truncate_tablespace(): Reinitialize the undo tablespace in a single mini-transaction, which will be flushed to the redo log before the file size is trimmed. recv_addr_trim(): Discard any redo logs for pages that were logged after the new end of a file, before the truncation LSN. If the rec_list becomes empty, reduce n_addrs. After removing any affected records, actually truncate the file. recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before applying any log records. The undo tablespace files must be open at this point. buf_flush_or_remove_pages(), buf_flush_dirty_pages(), buf_LRU_flush_or_remove_pages(): Add a parameter for specifying the number of the first page to flush or remove (default 0). trx_purge_initiate_truncate(): Remove the log checkpoints, the extra logging, and some unnecessary crash points. Merge the code from trx_undo_truncate_tablespace(). First, flush all to-be-discarded pages (beyond the new end of the file), then trim the space->size to make the page allocation deterministic. At the only remaining crash injection point, flush the redo log, so that the recovery can be tested.
7 years ago
MDEV-13564 Mariabackup does not work with TRUNCATE Implement undo tablespace truncation via normal redo logging. Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name, CREATE, and DROP. Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2 is killed before the DROP operation is committed. If MariaDB Server 10.2 is killed during TRUNCATE, it is also possible that the old table was renamed to #sql-ib*.ibd but the data dictionary will refer to the table using the original name. In MariaDB Server 10.3, RENAME inside InnoDB is transactional, and #sql-* tables will be dropped on startup. So, this new TRUNCATE will be fully crash-safe in 10.3. ha_mroonga::wrapper_truncate(): Pass table options to the underlying storage engine, now that ha_innobase::truncate() will need them. rpl_slave_state::truncate_state_table(): Before truncating mysql.gtid_slave_pos, evict any cached table handles from the table definition cache, so that there will be no stale references to the old table after truncating. == TRUNCATE TABLE == WL#6501 in MySQL 5.7 introduced separate log files for implementing atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB undo and redo log. Some convoluted logic was added to the InnoDB crash recovery, and some extra synchronization (including a redo log checkpoint) was introduced to make this work. This synchronization has caused performance problems and race conditions, and the extra log files cannot be copied or applied by external backup programs. In order to support crash-upgrade from MariaDB 10.2, we will keep the logic for parsing and applying the extra log files, but we will no longer generate those files in TRUNCATE TABLE. A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE (with full redo and undo logging and proper rollback). This will be implemented in MDEV-14717. ha_innobase::truncate(): Invoke RENAME, create(), delete_table(). Because RENAME cannot be fully rolled back before MariaDB 10.3 due to missing undo logging, add some explicit rename-back in case the operation fails. ha_innobase::delete(): Introduce a variant that takes sqlcom as a parameter. In TRUNCATE TABLE, we do not want to touch any FOREIGN KEY constraints. ha_innobase::create(): Add the parameters file_per_table, trx. In TRUNCATE, the new table must be created in the same transaction that renames the old table. create_table_info_t::create_table_info_t(): Add the parameters file_per_table, trx. row_drop_table_for_mysql(): Replace a bool parameter with sqlcom. row_drop_table_after_create_fail(): New function, wrapping row_drop_table_for_mysql(). dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(), fil_prepare_for_truncate(), fil_reinit_space_header_for_table(), row_truncate_table_for_mysql(), TruncateLogger, row_truncate_prepare(), row_truncate_rollback(), row_truncate_complete(), row_truncate_fts(), row_truncate_update_system_tables(), row_truncate_foreign_key_checks(), row_truncate_sanity_checks(): Remove. row_upd_check_references_constraints(): Remove a check for TRUNCATE, now that the table is no longer truncated in place. The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some race-condition like scenarios. The test innodb-innodb.truncate does not use any synchronization. We add a redo log subformat to indicate backup-friendly format. MariaDB 10.4 will remove support for the old TRUNCATE logging, so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve limitations. == Undo tablespace truncation == MySQL 5.7 implements undo tablespace truncation. It is only possible when innodb_undo_tablespaces is set to at least 2. The logging is implemented similar to the WL#6501 TRUNCATE, that is, using separate log files and a redo log checkpoint. We can simply implement undo tablespace truncation within a single mini-transaction that reinitializes the undo log tablespace file. Unfortunately, due to the redo log format of some operations, currently, the total redo log written by undo tablespace truncation will be more than the combined size of the truncated undo tablespace. It should be acceptable to have a little more than 1 megabyte of log in a single mini-transaction. This will be fixed in MDEV-17138 in MariaDB Server 10.4. recv_sys_t: Add truncated_undo_spaces[] to remember for which undo tablespaces a MLOG_FILE_CREATE2 record was seen. namespace undo: Remove some unnecessary declarations. fil_space_t::is_being_truncated: Document that this flag now only applies to undo tablespaces. Remove some references. fil_space_t::is_stopping(): Do not refer to is_being_truncated. This check is for tablespaces of tables. Potentially used tablespaces are never truncated any more. buf_dblwr_process(): Suppress the out-of-bounds warning for undo tablespaces. fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero page number (new size of the tablespace in pages) to inform crash recovery that the undo tablespace size has been reduced. fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2 can be written for undo tablespaces (without .ibd file suffix) for a nonzero page number. os_file_truncate(): Add the parameter allow_shrink=false so that undo tablespaces can actually be shrunk using this function. fil_name_parse(): For undo tablespace truncation, buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[]. recv_read_in_area(): Avoid reading pages for which no redo log records remain buffered, after recv_addr_trim() removed them. trx_rseg_header_create(): Add a FIXME comment that we could write much less redo log. trx_undo_truncate_tablespace(): Reinitialize the undo tablespace in a single mini-transaction, which will be flushed to the redo log before the file size is trimmed. recv_addr_trim(): Discard any redo logs for pages that were logged after the new end of a file, before the truncation LSN. If the rec_list becomes empty, reduce n_addrs. After removing any affected records, actually truncate the file. recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before applying any log records. The undo tablespace files must be open at this point. buf_flush_or_remove_pages(), buf_flush_dirty_pages(), buf_LRU_flush_or_remove_pages(): Add a parameter for specifying the number of the first page to flush or remove (default 0). trx_purge_initiate_truncate(): Remove the log checkpoints, the extra logging, and some unnecessary crash points. Merge the code from trx_undo_truncate_tablespace(). First, flush all to-be-discarded pages (beyond the new end of the file), then trim the space->size to make the page allocation deterministic. At the only remaining crash injection point, flush the redo log, so that the recovery can be tested.
7 years ago
MDEV-13564 Mariabackup does not work with TRUNCATE Implement undo tablespace truncation via normal redo logging. Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name, CREATE, and DROP. Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2 is killed before the DROP operation is committed. If MariaDB Server 10.2 is killed during TRUNCATE, it is also possible that the old table was renamed to #sql-ib*.ibd but the data dictionary will refer to the table using the original name. In MariaDB Server 10.3, RENAME inside InnoDB is transactional, and #sql-* tables will be dropped on startup. So, this new TRUNCATE will be fully crash-safe in 10.3. ha_mroonga::wrapper_truncate(): Pass table options to the underlying storage engine, now that ha_innobase::truncate() will need them. rpl_slave_state::truncate_state_table(): Before truncating mysql.gtid_slave_pos, evict any cached table handles from the table definition cache, so that there will be no stale references to the old table after truncating. == TRUNCATE TABLE == WL#6501 in MySQL 5.7 introduced separate log files for implementing atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB undo and redo log. Some convoluted logic was added to the InnoDB crash recovery, and some extra synchronization (including a redo log checkpoint) was introduced to make this work. This synchronization has caused performance problems and race conditions, and the extra log files cannot be copied or applied by external backup programs. In order to support crash-upgrade from MariaDB 10.2, we will keep the logic for parsing and applying the extra log files, but we will no longer generate those files in TRUNCATE TABLE. A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE (with full redo and undo logging and proper rollback). This will be implemented in MDEV-14717. ha_innobase::truncate(): Invoke RENAME, create(), delete_table(). Because RENAME cannot be fully rolled back before MariaDB 10.3 due to missing undo logging, add some explicit rename-back in case the operation fails. ha_innobase::delete(): Introduce a variant that takes sqlcom as a parameter. In TRUNCATE TABLE, we do not want to touch any FOREIGN KEY constraints. ha_innobase::create(): Add the parameters file_per_table, trx. In TRUNCATE, the new table must be created in the same transaction that renames the old table. create_table_info_t::create_table_info_t(): Add the parameters file_per_table, trx. row_drop_table_for_mysql(): Replace a bool parameter with sqlcom. row_drop_table_after_create_fail(): New function, wrapping row_drop_table_for_mysql(). dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(), fil_prepare_for_truncate(), fil_reinit_space_header_for_table(), row_truncate_table_for_mysql(), TruncateLogger, row_truncate_prepare(), row_truncate_rollback(), row_truncate_complete(), row_truncate_fts(), row_truncate_update_system_tables(), row_truncate_foreign_key_checks(), row_truncate_sanity_checks(): Remove. row_upd_check_references_constraints(): Remove a check for TRUNCATE, now that the table is no longer truncated in place. The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some race-condition like scenarios. The test innodb-innodb.truncate does not use any synchronization. We add a redo log subformat to indicate backup-friendly format. MariaDB 10.4 will remove support for the old TRUNCATE logging, so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve limitations. == Undo tablespace truncation == MySQL 5.7 implements undo tablespace truncation. It is only possible when innodb_undo_tablespaces is set to at least 2. The logging is implemented similar to the WL#6501 TRUNCATE, that is, using separate log files and a redo log checkpoint. We can simply implement undo tablespace truncation within a single mini-transaction that reinitializes the undo log tablespace file. Unfortunately, due to the redo log format of some operations, currently, the total redo log written by undo tablespace truncation will be more than the combined size of the truncated undo tablespace. It should be acceptable to have a little more than 1 megabyte of log in a single mini-transaction. This will be fixed in MDEV-17138 in MariaDB Server 10.4. recv_sys_t: Add truncated_undo_spaces[] to remember for which undo tablespaces a MLOG_FILE_CREATE2 record was seen. namespace undo: Remove some unnecessary declarations. fil_space_t::is_being_truncated: Document that this flag now only applies to undo tablespaces. Remove some references. fil_space_t::is_stopping(): Do not refer to is_being_truncated. This check is for tablespaces of tables. Potentially used tablespaces are never truncated any more. buf_dblwr_process(): Suppress the out-of-bounds warning for undo tablespaces. fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero page number (new size of the tablespace in pages) to inform crash recovery that the undo tablespace size has been reduced. fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2 can be written for undo tablespaces (without .ibd file suffix) for a nonzero page number. os_file_truncate(): Add the parameter allow_shrink=false so that undo tablespaces can actually be shrunk using this function. fil_name_parse(): For undo tablespace truncation, buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[]. recv_read_in_area(): Avoid reading pages for which no redo log records remain buffered, after recv_addr_trim() removed them. trx_rseg_header_create(): Add a FIXME comment that we could write much less redo log. trx_undo_truncate_tablespace(): Reinitialize the undo tablespace in a single mini-transaction, which will be flushed to the redo log before the file size is trimmed. recv_addr_trim(): Discard any redo logs for pages that were logged after the new end of a file, before the truncation LSN. If the rec_list becomes empty, reduce n_addrs. After removing any affected records, actually truncate the file. recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before applying any log records. The undo tablespace files must be open at this point. buf_flush_or_remove_pages(), buf_flush_dirty_pages(), buf_LRU_flush_or_remove_pages(): Add a parameter for specifying the number of the first page to flush or remove (default 0). trx_purge_initiate_truncate(): Remove the log checkpoints, the extra logging, and some unnecessary crash points. Merge the code from trx_undo_truncate_tablespace(). First, flush all to-be-discarded pages (beyond the new end of the file), then trim the space->size to make the page allocation deterministic. At the only remaining crash injection point, flush the redo log, so that the recovery can be tested.
7 years ago
MDEV-13564 Mariabackup does not work with TRUNCATE Implement undo tablespace truncation via normal redo logging. Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name, CREATE, and DROP. Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2 is killed before the DROP operation is committed. If MariaDB Server 10.2 is killed during TRUNCATE, it is also possible that the old table was renamed to #sql-ib*.ibd but the data dictionary will refer to the table using the original name. In MariaDB Server 10.3, RENAME inside InnoDB is transactional, and #sql-* tables will be dropped on startup. So, this new TRUNCATE will be fully crash-safe in 10.3. ha_mroonga::wrapper_truncate(): Pass table options to the underlying storage engine, now that ha_innobase::truncate() will need them. rpl_slave_state::truncate_state_table(): Before truncating mysql.gtid_slave_pos, evict any cached table handles from the table definition cache, so that there will be no stale references to the old table after truncating. == TRUNCATE TABLE == WL#6501 in MySQL 5.7 introduced separate log files for implementing atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB undo and redo log. Some convoluted logic was added to the InnoDB crash recovery, and some extra synchronization (including a redo log checkpoint) was introduced to make this work. This synchronization has caused performance problems and race conditions, and the extra log files cannot be copied or applied by external backup programs. In order to support crash-upgrade from MariaDB 10.2, we will keep the logic for parsing and applying the extra log files, but we will no longer generate those files in TRUNCATE TABLE. A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE (with full redo and undo logging and proper rollback). This will be implemented in MDEV-14717. ha_innobase::truncate(): Invoke RENAME, create(), delete_table(). Because RENAME cannot be fully rolled back before MariaDB 10.3 due to missing undo logging, add some explicit rename-back in case the operation fails. ha_innobase::delete(): Introduce a variant that takes sqlcom as a parameter. In TRUNCATE TABLE, we do not want to touch any FOREIGN KEY constraints. ha_innobase::create(): Add the parameters file_per_table, trx. In TRUNCATE, the new table must be created in the same transaction that renames the old table. create_table_info_t::create_table_info_t(): Add the parameters file_per_table, trx. row_drop_table_for_mysql(): Replace a bool parameter with sqlcom. row_drop_table_after_create_fail(): New function, wrapping row_drop_table_for_mysql(). dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(), fil_prepare_for_truncate(), fil_reinit_space_header_for_table(), row_truncate_table_for_mysql(), TruncateLogger, row_truncate_prepare(), row_truncate_rollback(), row_truncate_complete(), row_truncate_fts(), row_truncate_update_system_tables(), row_truncate_foreign_key_checks(), row_truncate_sanity_checks(): Remove. row_upd_check_references_constraints(): Remove a check for TRUNCATE, now that the table is no longer truncated in place. The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some race-condition like scenarios. The test innodb-innodb.truncate does not use any synchronization. We add a redo log subformat to indicate backup-friendly format. MariaDB 10.4 will remove support for the old TRUNCATE logging, so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve limitations. == Undo tablespace truncation == MySQL 5.7 implements undo tablespace truncation. It is only possible when innodb_undo_tablespaces is set to at least 2. The logging is implemented similar to the WL#6501 TRUNCATE, that is, using separate log files and a redo log checkpoint. We can simply implement undo tablespace truncation within a single mini-transaction that reinitializes the undo log tablespace file. Unfortunately, due to the redo log format of some operations, currently, the total redo log written by undo tablespace truncation will be more than the combined size of the truncated undo tablespace. It should be acceptable to have a little more than 1 megabyte of log in a single mini-transaction. This will be fixed in MDEV-17138 in MariaDB Server 10.4. recv_sys_t: Add truncated_undo_spaces[] to remember for which undo tablespaces a MLOG_FILE_CREATE2 record was seen. namespace undo: Remove some unnecessary declarations. fil_space_t::is_being_truncated: Document that this flag now only applies to undo tablespaces. Remove some references. fil_space_t::is_stopping(): Do not refer to is_being_truncated. This check is for tablespaces of tables. Potentially used tablespaces are never truncated any more. buf_dblwr_process(): Suppress the out-of-bounds warning for undo tablespaces. fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero page number (new size of the tablespace in pages) to inform crash recovery that the undo tablespace size has been reduced. fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2 can be written for undo tablespaces (without .ibd file suffix) for a nonzero page number. os_file_truncate(): Add the parameter allow_shrink=false so that undo tablespaces can actually be shrunk using this function. fil_name_parse(): For undo tablespace truncation, buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[]. recv_read_in_area(): Avoid reading pages for which no redo log records remain buffered, after recv_addr_trim() removed them. trx_rseg_header_create(): Add a FIXME comment that we could write much less redo log. trx_undo_truncate_tablespace(): Reinitialize the undo tablespace in a single mini-transaction, which will be flushed to the redo log before the file size is trimmed. recv_addr_trim(): Discard any redo logs for pages that were logged after the new end of a file, before the truncation LSN. If the rec_list becomes empty, reduce n_addrs. After removing any affected records, actually truncate the file. recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before applying any log records. The undo tablespace files must be open at this point. buf_flush_or_remove_pages(), buf_flush_dirty_pages(), buf_LRU_flush_or_remove_pages(): Add a parameter for specifying the number of the first page to flush or remove (default 0). trx_purge_initiate_truncate(): Remove the log checkpoints, the extra logging, and some unnecessary crash points. Merge the code from trx_undo_truncate_tablespace(). First, flush all to-be-discarded pages (beyond the new end of the file), then trim the space->size to make the page allocation deterministic. At the only remaining crash injection point, flush the redo log, so that the recovery can be tested.
7 years ago
MDEV-13564 Mariabackup does not work with TRUNCATE Implement undo tablespace truncation via normal redo logging. Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name, CREATE, and DROP. Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2 is killed before the DROP operation is committed. If MariaDB Server 10.2 is killed during TRUNCATE, it is also possible that the old table was renamed to #sql-ib*.ibd but the data dictionary will refer to the table using the original name. In MariaDB Server 10.3, RENAME inside InnoDB is transactional, and #sql-* tables will be dropped on startup. So, this new TRUNCATE will be fully crash-safe in 10.3. ha_mroonga::wrapper_truncate(): Pass table options to the underlying storage engine, now that ha_innobase::truncate() will need them. rpl_slave_state::truncate_state_table(): Before truncating mysql.gtid_slave_pos, evict any cached table handles from the table definition cache, so that there will be no stale references to the old table after truncating. == TRUNCATE TABLE == WL#6501 in MySQL 5.7 introduced separate log files for implementing atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB undo and redo log. Some convoluted logic was added to the InnoDB crash recovery, and some extra synchronization (including a redo log checkpoint) was introduced to make this work. This synchronization has caused performance problems and race conditions, and the extra log files cannot be copied or applied by external backup programs. In order to support crash-upgrade from MariaDB 10.2, we will keep the logic for parsing and applying the extra log files, but we will no longer generate those files in TRUNCATE TABLE. A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE (with full redo and undo logging and proper rollback). This will be implemented in MDEV-14717. ha_innobase::truncate(): Invoke RENAME, create(), delete_table(). Because RENAME cannot be fully rolled back before MariaDB 10.3 due to missing undo logging, add some explicit rename-back in case the operation fails. ha_innobase::delete(): Introduce a variant that takes sqlcom as a parameter. In TRUNCATE TABLE, we do not want to touch any FOREIGN KEY constraints. ha_innobase::create(): Add the parameters file_per_table, trx. In TRUNCATE, the new table must be created in the same transaction that renames the old table. create_table_info_t::create_table_info_t(): Add the parameters file_per_table, trx. row_drop_table_for_mysql(): Replace a bool parameter with sqlcom. row_drop_table_after_create_fail(): New function, wrapping row_drop_table_for_mysql(). dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(), fil_prepare_for_truncate(), fil_reinit_space_header_for_table(), row_truncate_table_for_mysql(), TruncateLogger, row_truncate_prepare(), row_truncate_rollback(), row_truncate_complete(), row_truncate_fts(), row_truncate_update_system_tables(), row_truncate_foreign_key_checks(), row_truncate_sanity_checks(): Remove. row_upd_check_references_constraints(): Remove a check for TRUNCATE, now that the table is no longer truncated in place. The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some race-condition like scenarios. The test innodb-innodb.truncate does not use any synchronization. We add a redo log subformat to indicate backup-friendly format. MariaDB 10.4 will remove support for the old TRUNCATE logging, so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve limitations. == Undo tablespace truncation == MySQL 5.7 implements undo tablespace truncation. It is only possible when innodb_undo_tablespaces is set to at least 2. The logging is implemented similar to the WL#6501 TRUNCATE, that is, using separate log files and a redo log checkpoint. We can simply implement undo tablespace truncation within a single mini-transaction that reinitializes the undo log tablespace file. Unfortunately, due to the redo log format of some operations, currently, the total redo log written by undo tablespace truncation will be more than the combined size of the truncated undo tablespace. It should be acceptable to have a little more than 1 megabyte of log in a single mini-transaction. This will be fixed in MDEV-17138 in MariaDB Server 10.4. recv_sys_t: Add truncated_undo_spaces[] to remember for which undo tablespaces a MLOG_FILE_CREATE2 record was seen. namespace undo: Remove some unnecessary declarations. fil_space_t::is_being_truncated: Document that this flag now only applies to undo tablespaces. Remove some references. fil_space_t::is_stopping(): Do not refer to is_being_truncated. This check is for tablespaces of tables. Potentially used tablespaces are never truncated any more. buf_dblwr_process(): Suppress the out-of-bounds warning for undo tablespaces. fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero page number (new size of the tablespace in pages) to inform crash recovery that the undo tablespace size has been reduced. fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2 can be written for undo tablespaces (without .ibd file suffix) for a nonzero page number. os_file_truncate(): Add the parameter allow_shrink=false so that undo tablespaces can actually be shrunk using this function. fil_name_parse(): For undo tablespace truncation, buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[]. recv_read_in_area(): Avoid reading pages for which no redo log records remain buffered, after recv_addr_trim() removed them. trx_rseg_header_create(): Add a FIXME comment that we could write much less redo log. trx_undo_truncate_tablespace(): Reinitialize the undo tablespace in a single mini-transaction, which will be flushed to the redo log before the file size is trimmed. recv_addr_trim(): Discard any redo logs for pages that were logged after the new end of a file, before the truncation LSN. If the rec_list becomes empty, reduce n_addrs. After removing any affected records, actually truncate the file. recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before applying any log records. The undo tablespace files must be open at this point. buf_flush_or_remove_pages(), buf_flush_dirty_pages(), buf_LRU_flush_or_remove_pages(): Add a parameter for specifying the number of the first page to flush or remove (default 0). trx_purge_initiate_truncate(): Remove the log checkpoints, the extra logging, and some unnecessary crash points. Merge the code from trx_undo_truncate_tablespace(). First, flush all to-be-discarded pages (beyond the new end of the file), then trim the space->size to make the page allocation deterministic. At the only remaining crash injection point, flush the redo log, so that the recovery can be tested.
7 years ago
MDEV-13564 Mariabackup does not work with TRUNCATE Implement undo tablespace truncation via normal redo logging. Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name, CREATE, and DROP. Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2 is killed before the DROP operation is committed. If MariaDB Server 10.2 is killed during TRUNCATE, it is also possible that the old table was renamed to #sql-ib*.ibd but the data dictionary will refer to the table using the original name. In MariaDB Server 10.3, RENAME inside InnoDB is transactional, and #sql-* tables will be dropped on startup. So, this new TRUNCATE will be fully crash-safe in 10.3. ha_mroonga::wrapper_truncate(): Pass table options to the underlying storage engine, now that ha_innobase::truncate() will need them. rpl_slave_state::truncate_state_table(): Before truncating mysql.gtid_slave_pos, evict any cached table handles from the table definition cache, so that there will be no stale references to the old table after truncating. == TRUNCATE TABLE == WL#6501 in MySQL 5.7 introduced separate log files for implementing atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB undo and redo log. Some convoluted logic was added to the InnoDB crash recovery, and some extra synchronization (including a redo log checkpoint) was introduced to make this work. This synchronization has caused performance problems and race conditions, and the extra log files cannot be copied or applied by external backup programs. In order to support crash-upgrade from MariaDB 10.2, we will keep the logic for parsing and applying the extra log files, but we will no longer generate those files in TRUNCATE TABLE. A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE (with full redo and undo logging and proper rollback). This will be implemented in MDEV-14717. ha_innobase::truncate(): Invoke RENAME, create(), delete_table(). Because RENAME cannot be fully rolled back before MariaDB 10.3 due to missing undo logging, add some explicit rename-back in case the operation fails. ha_innobase::delete(): Introduce a variant that takes sqlcom as a parameter. In TRUNCATE TABLE, we do not want to touch any FOREIGN KEY constraints. ha_innobase::create(): Add the parameters file_per_table, trx. In TRUNCATE, the new table must be created in the same transaction that renames the old table. create_table_info_t::create_table_info_t(): Add the parameters file_per_table, trx. row_drop_table_for_mysql(): Replace a bool parameter with sqlcom. row_drop_table_after_create_fail(): New function, wrapping row_drop_table_for_mysql(). dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(), fil_prepare_for_truncate(), fil_reinit_space_header_for_table(), row_truncate_table_for_mysql(), TruncateLogger, row_truncate_prepare(), row_truncate_rollback(), row_truncate_complete(), row_truncate_fts(), row_truncate_update_system_tables(), row_truncate_foreign_key_checks(), row_truncate_sanity_checks(): Remove. row_upd_check_references_constraints(): Remove a check for TRUNCATE, now that the table is no longer truncated in place. The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some race-condition like scenarios. The test innodb-innodb.truncate does not use any synchronization. We add a redo log subformat to indicate backup-friendly format. MariaDB 10.4 will remove support for the old TRUNCATE logging, so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve limitations. == Undo tablespace truncation == MySQL 5.7 implements undo tablespace truncation. It is only possible when innodb_undo_tablespaces is set to at least 2. The logging is implemented similar to the WL#6501 TRUNCATE, that is, using separate log files and a redo log checkpoint. We can simply implement undo tablespace truncation within a single mini-transaction that reinitializes the undo log tablespace file. Unfortunately, due to the redo log format of some operations, currently, the total redo log written by undo tablespace truncation will be more than the combined size of the truncated undo tablespace. It should be acceptable to have a little more than 1 megabyte of log in a single mini-transaction. This will be fixed in MDEV-17138 in MariaDB Server 10.4. recv_sys_t: Add truncated_undo_spaces[] to remember for which undo tablespaces a MLOG_FILE_CREATE2 record was seen. namespace undo: Remove some unnecessary declarations. fil_space_t::is_being_truncated: Document that this flag now only applies to undo tablespaces. Remove some references. fil_space_t::is_stopping(): Do not refer to is_being_truncated. This check is for tablespaces of tables. Potentially used tablespaces are never truncated any more. buf_dblwr_process(): Suppress the out-of-bounds warning for undo tablespaces. fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero page number (new size of the tablespace in pages) to inform crash recovery that the undo tablespace size has been reduced. fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2 can be written for undo tablespaces (without .ibd file suffix) for a nonzero page number. os_file_truncate(): Add the parameter allow_shrink=false so that undo tablespaces can actually be shrunk using this function. fil_name_parse(): For undo tablespace truncation, buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[]. recv_read_in_area(): Avoid reading pages for which no redo log records remain buffered, after recv_addr_trim() removed them. trx_rseg_header_create(): Add a FIXME comment that we could write much less redo log. trx_undo_truncate_tablespace(): Reinitialize the undo tablespace in a single mini-transaction, which will be flushed to the redo log before the file size is trimmed. recv_addr_trim(): Discard any redo logs for pages that were logged after the new end of a file, before the truncation LSN. If the rec_list becomes empty, reduce n_addrs. After removing any affected records, actually truncate the file. recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before applying any log records. The undo tablespace files must be open at this point. buf_flush_or_remove_pages(), buf_flush_dirty_pages(), buf_LRU_flush_or_remove_pages(): Add a parameter for specifying the number of the first page to flush or remove (default 0). trx_purge_initiate_truncate(): Remove the log checkpoints, the extra logging, and some unnecessary crash points. Merge the code from trx_undo_truncate_tablespace(). First, flush all to-be-discarded pages (beyond the new end of the file), then trim the space->size to make the page allocation deterministic. At the only remaining crash injection point, flush the redo log, so that the recovery can be tested.
7 years ago
MDEV-13328 ALTER TABLE…DISCARD TABLESPACE takes a lot of time With a big buffer pool that contains many data pages, DISCARD TABLESPACE took a long time, because it would scan the entire buffer pool to remove any pages that belong to the tablespace. With a large buffer pool, this would take a lot of time, especially when the table-to-discard is empty. The minimum amount of work that DISCARD TABLESPACE must do is to remove the pages of the to-be-discarded table from the buf_pool->flush_list because any writes to the data file must be prevented before the file is deleted. If DISCARD TABLESPACE does not evict the pages from the buffer pool, then IMPORT TABLESPACE must do it, because we must prevent pre-DISCARD, not-yet-evicted pages from being mistaken for pages of the imported tablespace. It would not be a useful fix to simply move the buffer pool scan to the IMPORT TABLESPACE step. What we can do is to actively evict those pages that could be mistaken for imported pages. In this way, when importing a small table into a big buffer pool, the import should still run relatively fast. Import is bypassing the buffer pool when reading pages for the adjustment phase. In the adjustment phase, if a page exists in the buffer pool, we could replace it with the page from the imported file. Unfortunately I did not get this to work properly, so instead we will simply evict any matching page from the buffer pool. buf_page_get_gen(): Implement BUF_EVICT_IF_IN_POOL, a new mode where the requested page will be evicted if it is found. There must be no unwritten changes for the page. buf_remove_t: Remove. Instead, use trx!=NULL to signify that a write to file is desired, and use a separate parameter bool drop_ahi. buf_LRU_flush_or_remove_pages(), fil_delete_tablespace(): Replace buf_remove_t. buf_LRU_remove_pages(), buf_LRU_remove_all_pages(): Remove. PageConverter::m_mtr: A dummy mini-transaction buffer PageConverter::PageConverter(): Complete the member initialization list. PageConverter::operator()(): Evict any 'shadow' pages from the buffer pool so that pre-existing (garbage) pages cannot be mistaken for pages that exist in the being-imported file. row_discard_tablespace(): Remove a bogus comment that seems to refer to IMPORT TABLESPACE, not DISCARD TABLESPACE.
8 years ago
MDEV-13564 Mariabackup does not work with TRUNCATE Implement undo tablespace truncation via normal redo logging. Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name, CREATE, and DROP. Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2 is killed before the DROP operation is committed. If MariaDB Server 10.2 is killed during TRUNCATE, it is also possible that the old table was renamed to #sql-ib*.ibd but the data dictionary will refer to the table using the original name. In MariaDB Server 10.3, RENAME inside InnoDB is transactional, and #sql-* tables will be dropped on startup. So, this new TRUNCATE will be fully crash-safe in 10.3. ha_mroonga::wrapper_truncate(): Pass table options to the underlying storage engine, now that ha_innobase::truncate() will need them. rpl_slave_state::truncate_state_table(): Before truncating mysql.gtid_slave_pos, evict any cached table handles from the table definition cache, so that there will be no stale references to the old table after truncating. == TRUNCATE TABLE == WL#6501 in MySQL 5.7 introduced separate log files for implementing atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB undo and redo log. Some convoluted logic was added to the InnoDB crash recovery, and some extra synchronization (including a redo log checkpoint) was introduced to make this work. This synchronization has caused performance problems and race conditions, and the extra log files cannot be copied or applied by external backup programs. In order to support crash-upgrade from MariaDB 10.2, we will keep the logic for parsing and applying the extra log files, but we will no longer generate those files in TRUNCATE TABLE. A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE (with full redo and undo logging and proper rollback). This will be implemented in MDEV-14717. ha_innobase::truncate(): Invoke RENAME, create(), delete_table(). Because RENAME cannot be fully rolled back before MariaDB 10.3 due to missing undo logging, add some explicit rename-back in case the operation fails. ha_innobase::delete(): Introduce a variant that takes sqlcom as a parameter. In TRUNCATE TABLE, we do not want to touch any FOREIGN KEY constraints. ha_innobase::create(): Add the parameters file_per_table, trx. In TRUNCATE, the new table must be created in the same transaction that renames the old table. create_table_info_t::create_table_info_t(): Add the parameters file_per_table, trx. row_drop_table_for_mysql(): Replace a bool parameter with sqlcom. row_drop_table_after_create_fail(): New function, wrapping row_drop_table_for_mysql(). dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(), fil_prepare_for_truncate(), fil_reinit_space_header_for_table(), row_truncate_table_for_mysql(), TruncateLogger, row_truncate_prepare(), row_truncate_rollback(), row_truncate_complete(), row_truncate_fts(), row_truncate_update_system_tables(), row_truncate_foreign_key_checks(), row_truncate_sanity_checks(): Remove. row_upd_check_references_constraints(): Remove a check for TRUNCATE, now that the table is no longer truncated in place. The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some race-condition like scenarios. The test innodb-innodb.truncate does not use any synchronization. We add a redo log subformat to indicate backup-friendly format. MariaDB 10.4 will remove support for the old TRUNCATE logging, so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve limitations. == Undo tablespace truncation == MySQL 5.7 implements undo tablespace truncation. It is only possible when innodb_undo_tablespaces is set to at least 2. The logging is implemented similar to the WL#6501 TRUNCATE, that is, using separate log files and a redo log checkpoint. We can simply implement undo tablespace truncation within a single mini-transaction that reinitializes the undo log tablespace file. Unfortunately, due to the redo log format of some operations, currently, the total redo log written by undo tablespace truncation will be more than the combined size of the truncated undo tablespace. It should be acceptable to have a little more than 1 megabyte of log in a single mini-transaction. This will be fixed in MDEV-17138 in MariaDB Server 10.4. recv_sys_t: Add truncated_undo_spaces[] to remember for which undo tablespaces a MLOG_FILE_CREATE2 record was seen. namespace undo: Remove some unnecessary declarations. fil_space_t::is_being_truncated: Document that this flag now only applies to undo tablespaces. Remove some references. fil_space_t::is_stopping(): Do not refer to is_being_truncated. This check is for tablespaces of tables. Potentially used tablespaces are never truncated any more. buf_dblwr_process(): Suppress the out-of-bounds warning for undo tablespaces. fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero page number (new size of the tablespace in pages) to inform crash recovery that the undo tablespace size has been reduced. fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2 can be written for undo tablespaces (without .ibd file suffix) for a nonzero page number. os_file_truncate(): Add the parameter allow_shrink=false so that undo tablespaces can actually be shrunk using this function. fil_name_parse(): For undo tablespace truncation, buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[]. recv_read_in_area(): Avoid reading pages for which no redo log records remain buffered, after recv_addr_trim() removed them. trx_rseg_header_create(): Add a FIXME comment that we could write much less redo log. trx_undo_truncate_tablespace(): Reinitialize the undo tablespace in a single mini-transaction, which will be flushed to the redo log before the file size is trimmed. recv_addr_trim(): Discard any redo logs for pages that were logged after the new end of a file, before the truncation LSN. If the rec_list becomes empty, reduce n_addrs. After removing any affected records, actually truncate the file. recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before applying any log records. The undo tablespace files must be open at this point. buf_flush_or_remove_pages(), buf_flush_dirty_pages(), buf_LRU_flush_or_remove_pages(): Add a parameter for specifying the number of the first page to flush or remove (default 0). trx_purge_initiate_truncate(): Remove the log checkpoints, the extra logging, and some unnecessary crash points. Merge the code from trx_undo_truncate_tablespace(). First, flush all to-be-discarded pages (beyond the new end of the file), then trim the space->size to make the page allocation deterministic. At the only remaining crash injection point, flush the redo log, so that the recovery can be tested.
7 years ago
MDEV-13328 ALTER TABLE…DISCARD TABLESPACE takes a lot of time With a big buffer pool that contains many data pages, DISCARD TABLESPACE took a long time, because it would scan the entire buffer pool to remove any pages that belong to the tablespace. With a large buffer pool, this would take a lot of time, especially when the table-to-discard is empty. The minimum amount of work that DISCARD TABLESPACE must do is to remove the pages of the to-be-discarded table from the buf_pool->flush_list because any writes to the data file must be prevented before the file is deleted. If DISCARD TABLESPACE does not evict the pages from the buffer pool, then IMPORT TABLESPACE must do it, because we must prevent pre-DISCARD, not-yet-evicted pages from being mistaken for pages of the imported tablespace. It would not be a useful fix to simply move the buffer pool scan to the IMPORT TABLESPACE step. What we can do is to actively evict those pages that could be mistaken for imported pages. In this way, when importing a small table into a big buffer pool, the import should still run relatively fast. Import is bypassing the buffer pool when reading pages for the adjustment phase. In the adjustment phase, if a page exists in the buffer pool, we could replace it with the page from the imported file. Unfortunately I did not get this to work properly, so instead we will simply evict any matching page from the buffer pool. buf_page_get_gen(): Implement BUF_EVICT_IF_IN_POOL, a new mode where the requested page will be evicted if it is found. There must be no unwritten changes for the page. buf_remove_t: Remove. Instead, use trx!=NULL to signify that a write to file is desired, and use a separate parameter bool drop_ahi. buf_LRU_flush_or_remove_pages(), fil_delete_tablespace(): Replace buf_remove_t. buf_LRU_remove_pages(), buf_LRU_remove_all_pages(): Remove. PageConverter::m_mtr: A dummy mini-transaction buffer PageConverter::PageConverter(): Complete the member initialization list. PageConverter::operator()(): Evict any 'shadow' pages from the buffer pool so that pre-existing (garbage) pages cannot be mistaken for pages that exist in the being-imported file. row_discard_tablespace(): Remove a bogus comment that seems to refer to IMPORT TABLESPACE, not DISCARD TABLESPACE.
8 years ago
MDEV-13564 Mariabackup does not work with TRUNCATE Implement undo tablespace truncation via normal redo logging. Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name, CREATE, and DROP. Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2 is killed before the DROP operation is committed. If MariaDB Server 10.2 is killed during TRUNCATE, it is also possible that the old table was renamed to #sql-ib*.ibd but the data dictionary will refer to the table using the original name. In MariaDB Server 10.3, RENAME inside InnoDB is transactional, and #sql-* tables will be dropped on startup. So, this new TRUNCATE will be fully crash-safe in 10.3. ha_mroonga::wrapper_truncate(): Pass table options to the underlying storage engine, now that ha_innobase::truncate() will need them. rpl_slave_state::truncate_state_table(): Before truncating mysql.gtid_slave_pos, evict any cached table handles from the table definition cache, so that there will be no stale references to the old table after truncating. == TRUNCATE TABLE == WL#6501 in MySQL 5.7 introduced separate log files for implementing atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB undo and redo log. Some convoluted logic was added to the InnoDB crash recovery, and some extra synchronization (including a redo log checkpoint) was introduced to make this work. This synchronization has caused performance problems and race conditions, and the extra log files cannot be copied or applied by external backup programs. In order to support crash-upgrade from MariaDB 10.2, we will keep the logic for parsing and applying the extra log files, but we will no longer generate those files in TRUNCATE TABLE. A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE (with full redo and undo logging and proper rollback). This will be implemented in MDEV-14717. ha_innobase::truncate(): Invoke RENAME, create(), delete_table(). Because RENAME cannot be fully rolled back before MariaDB 10.3 due to missing undo logging, add some explicit rename-back in case the operation fails. ha_innobase::delete(): Introduce a variant that takes sqlcom as a parameter. In TRUNCATE TABLE, we do not want to touch any FOREIGN KEY constraints. ha_innobase::create(): Add the parameters file_per_table, trx. In TRUNCATE, the new table must be created in the same transaction that renames the old table. create_table_info_t::create_table_info_t(): Add the parameters file_per_table, trx. row_drop_table_for_mysql(): Replace a bool parameter with sqlcom. row_drop_table_after_create_fail(): New function, wrapping row_drop_table_for_mysql(). dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(), fil_prepare_for_truncate(), fil_reinit_space_header_for_table(), row_truncate_table_for_mysql(), TruncateLogger, row_truncate_prepare(), row_truncate_rollback(), row_truncate_complete(), row_truncate_fts(), row_truncate_update_system_tables(), row_truncate_foreign_key_checks(), row_truncate_sanity_checks(): Remove. row_upd_check_references_constraints(): Remove a check for TRUNCATE, now that the table is no longer truncated in place. The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some race-condition like scenarios. The test innodb-innodb.truncate does not use any synchronization. We add a redo log subformat to indicate backup-friendly format. MariaDB 10.4 will remove support for the old TRUNCATE logging, so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve limitations. == Undo tablespace truncation == MySQL 5.7 implements undo tablespace truncation. It is only possible when innodb_undo_tablespaces is set to at least 2. The logging is implemented similar to the WL#6501 TRUNCATE, that is, using separate log files and a redo log checkpoint. We can simply implement undo tablespace truncation within a single mini-transaction that reinitializes the undo log tablespace file. Unfortunately, due to the redo log format of some operations, currently, the total redo log written by undo tablespace truncation will be more than the combined size of the truncated undo tablespace. It should be acceptable to have a little more than 1 megabyte of log in a single mini-transaction. This will be fixed in MDEV-17138 in MariaDB Server 10.4. recv_sys_t: Add truncated_undo_spaces[] to remember for which undo tablespaces a MLOG_FILE_CREATE2 record was seen. namespace undo: Remove some unnecessary declarations. fil_space_t::is_being_truncated: Document that this flag now only applies to undo tablespaces. Remove some references. fil_space_t::is_stopping(): Do not refer to is_being_truncated. This check is for tablespaces of tables. Potentially used tablespaces are never truncated any more. buf_dblwr_process(): Suppress the out-of-bounds warning for undo tablespaces. fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero page number (new size of the tablespace in pages) to inform crash recovery that the undo tablespace size has been reduced. fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2 can be written for undo tablespaces (without .ibd file suffix) for a nonzero page number. os_file_truncate(): Add the parameter allow_shrink=false so that undo tablespaces can actually be shrunk using this function. fil_name_parse(): For undo tablespace truncation, buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[]. recv_read_in_area(): Avoid reading pages for which no redo log records remain buffered, after recv_addr_trim() removed them. trx_rseg_header_create(): Add a FIXME comment that we could write much less redo log. trx_undo_truncate_tablespace(): Reinitialize the undo tablespace in a single mini-transaction, which will be flushed to the redo log before the file size is trimmed. recv_addr_trim(): Discard any redo logs for pages that were logged after the new end of a file, before the truncation LSN. If the rec_list becomes empty, reduce n_addrs. After removing any affected records, actually truncate the file. recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before applying any log records. The undo tablespace files must be open at this point. buf_flush_or_remove_pages(), buf_flush_dirty_pages(), buf_LRU_flush_or_remove_pages(): Add a parameter for specifying the number of the first page to flush or remove (default 0). trx_purge_initiate_truncate(): Remove the log checkpoints, the extra logging, and some unnecessary crash points. Merge the code from trx_undo_truncate_tablespace(). First, flush all to-be-discarded pages (beyond the new end of the file), then trim the space->size to make the page allocation deterministic. At the only remaining crash injection point, flush the redo log, so that the recovery can be tested.
7 years ago
Bug#24346574 PAGE CLEANER THREAD, ASSERT BLOCK->N_POINTERS == 0 btr_search_drop_page_hash_index(): Do not return before ensuring that block->index=NULL, even if !btr_search_enabled. We would typically still skip acquiring the AHI latch when the AHI is disabled, because block->index would already be NULL. Only if the AHI is in the process of being disabled, we would wait for the AHI latch and then notice that block->index=NULL and return. The above bug was a regression caused in MySQL 5.7.9 by the fix of Bug#21407023: DISABLING AHI SHOULD AVOID TAKING AHI LATCH The rest of this patch improves diagnostics by adding assertions. assert_block_ahi_valid(): A debug predicate for checking that block->n_pointers!=0 implies block->index!=NULL. assert_block_ahi_empty(): A debug predicate for checking that block->n_pointers==0. buf_block_init(): Instead of assigning block->n_pointers=0, assert_block_ahi_empty(block). buf_pool_clear_hash_index(): Clarify comments, and assign block->n_pointers=0 before assigning block->index=NULL. The wrong ordering could make block->n_pointers appear incorrect in debug assertions. This bug was introduced in MySQL 5.1.52 by Bug#13006367 62487: INNODB TAKES 3 MINUTES TO CLEAN UP THE ADAPTIVE HASH INDEX AT SHUTDOWN i_s_innodb_buffer_page_get_info(): Add a comment that the IS_HASHED column in the INFORMATION_SCHEMA views INNODB_BUFFER_POOL_PAGE and INNODB_BUFFER_PAGE_LRU may show false positives (there may be no pointers after all.) ha_insert_for_fold_func(), ha_delete_hash_node(), ha_search_and_update_if_found_func(): Use atomics for updating buf_block_t::n_pointers. While buf_block_t::index is always protected by btr_search_x_lock(index), in ha_insert_for_fold_func() the n_pointers-- may belong to another dict_index_t whose btr_search_latches[] we are not holding. RB: 13879 Reviewed-by: Jimmy Yang <jimmy.yang@oracle.com>
9 years ago
MDEV-12674 Innodb_row_lock_current_waits has overflow There is a race condition related to the variable srv_stats.n_lock_wait_current_count, which is only incremented and decremented by the function lock_wait_suspend_thread(), The incrementing is protected by lock_sys->wait_mutex, but the decrementing does not appear to be protected by anything. This mismatch could allow the counter to be corrupted when a transactional InnoDB table or record lock wait is terminating roughly at the same time with the start of a wait on a (possibly different) lock. ib_counter_t: Remove some unused methods. Prevent instantiation for N=1. Add an inc() method that takes a slot index as a parameter. single_indexer_t: Remove. simple_counter<typename Type, bool atomic=false>: A new counter wrapper. Optionally use atomic memory operations for modifying the counter. Aligned to the cache line size. lsn_ctr_1_t, ulint_ctr_1_t, int64_ctr_1_t: Define as simple_counter<Type>. These counters are either only incremented (and we do not care about losing some increment operations), or the increment/decrement operations are protected by some mutex. srv_stats_t::os_log_pending_writes: Document that the number is protected by log_sys->mutex. srv_stats_t::n_lock_wait_current_count: Use simple_counter<ulint, true>, that is, atomic inc() and dec() operations. lock_wait_suspend_thread(): Release the mutexes before incrementing the counters. Avoid acquiring the lock mutex if the lock wait has already been resolved. Atomically increment and decrement srv_stats.n_lock_wait_current_count. row_insert_for_mysql(), row_update_for_mysql(), row_update_cascade_for_mysql(): Use the inc() method with the trx->id as the slot index. This is a non-functional change, just using inc() instead of add(1). buf_LRU_get_free_block(): Replace the method add(index, n) with inc(). There is no slot index in the simple_counter.
9 years ago
12 years ago
12 years ago
Bug#24346574 PAGE CLEANER THREAD, ASSERT BLOCK->N_POINTERS == 0 btr_search_drop_page_hash_index(): Do not return before ensuring that block->index=NULL, even if !btr_search_enabled. We would typically still skip acquiring the AHI latch when the AHI is disabled, because block->index would already be NULL. Only if the AHI is in the process of being disabled, we would wait for the AHI latch and then notice that block->index=NULL and return. The above bug was a regression caused in MySQL 5.7.9 by the fix of Bug#21407023: DISABLING AHI SHOULD AVOID TAKING AHI LATCH The rest of this patch improves diagnostics by adding assertions. assert_block_ahi_valid(): A debug predicate for checking that block->n_pointers!=0 implies block->index!=NULL. assert_block_ahi_empty(): A debug predicate for checking that block->n_pointers==0. buf_block_init(): Instead of assigning block->n_pointers=0, assert_block_ahi_empty(block). buf_pool_clear_hash_index(): Clarify comments, and assign block->n_pointers=0 before assigning block->index=NULL. The wrong ordering could make block->n_pointers appear incorrect in debug assertions. This bug was introduced in MySQL 5.1.52 by Bug#13006367 62487: INNODB TAKES 3 MINUTES TO CLEAN UP THE ADAPTIVE HASH INDEX AT SHUTDOWN i_s_innodb_buffer_page_get_info(): Add a comment that the IS_HASHED column in the INFORMATION_SCHEMA views INNODB_BUFFER_POOL_PAGE and INNODB_BUFFER_PAGE_LRU may show false positives (there may be no pointers after all.) ha_insert_for_fold_func(), ha_delete_hash_node(), ha_search_and_update_if_found_func(): Use atomics for updating buf_block_t::n_pointers. While buf_block_t::index is always protected by btr_search_x_lock(index), in ha_insert_for_fold_func() the n_pointers-- may belong to another dict_index_t whose btr_search_latches[] we are not holding. RB: 13879 Reviewed-by: Jimmy Yang <jimmy.yang@oracle.com>
9 years ago
Bug#24346574 PAGE CLEANER THREAD, ASSERT BLOCK->N_POINTERS == 0 btr_search_drop_page_hash_index(): Do not return before ensuring that block->index=NULL, even if !btr_search_enabled. We would typically still skip acquiring the AHI latch when the AHI is disabled, because block->index would already be NULL. Only if the AHI is in the process of being disabled, we would wait for the AHI latch and then notice that block->index=NULL and return. The above bug was a regression caused in MySQL 5.7.9 by the fix of Bug#21407023: DISABLING AHI SHOULD AVOID TAKING AHI LATCH The rest of this patch improves diagnostics by adding assertions. assert_block_ahi_valid(): A debug predicate for checking that block->n_pointers!=0 implies block->index!=NULL. assert_block_ahi_empty(): A debug predicate for checking that block->n_pointers==0. buf_block_init(): Instead of assigning block->n_pointers=0, assert_block_ahi_empty(block). buf_pool_clear_hash_index(): Clarify comments, and assign block->n_pointers=0 before assigning block->index=NULL. The wrong ordering could make block->n_pointers appear incorrect in debug assertions. This bug was introduced in MySQL 5.1.52 by Bug#13006367 62487: INNODB TAKES 3 MINUTES TO CLEAN UP THE ADAPTIVE HASH INDEX AT SHUTDOWN i_s_innodb_buffer_page_get_info(): Add a comment that the IS_HASHED column in the INFORMATION_SCHEMA views INNODB_BUFFER_POOL_PAGE and INNODB_BUFFER_PAGE_LRU may show false positives (there may be no pointers after all.) ha_insert_for_fold_func(), ha_delete_hash_node(), ha_search_and_update_if_found_func(): Use atomics for updating buf_block_t::n_pointers. While buf_block_t::index is always protected by btr_search_x_lock(index), in ha_insert_for_fold_func() the n_pointers-- may belong to another dict_index_t whose btr_search_latches[] we are not holding. RB: 13879 Reviewed-by: Jimmy Yang <jimmy.yang@oracle.com>
9 years ago
MDEV-19614 SET GLOBAL innodb_ deadlock due to LOCK_global_system_variables The update callback functions for several settable global InnoDB variables are acquiring InnoDB latches while holding LOCK_global_system_variables. On the other hand, some InnoDB code is invoking THDVAR() while holding InnoDB latches. An example of this is thd_lock_wait_timeout() that is called by lock_rec_enqueue_waiting(). In some cases, the intern_sys_var_ptr() that is invoked by THDVAR() may acquire LOCK_global_system_variables, via sync_dynamic_session_variables(). In lock_rec_enqueue_waiting(), we really must be holding some InnoDB latch while invoking THDVAR(). This implies that LOCK_global_system_variables must conceptually reside below any InnoDB latch in the latching order. That in turns implies that the various update callback functions must release LOCK_global_system_variables before acquiring any InnoDB mutexes or rw-locks, and reacquire LOCK_global_system_variables later. The validate functions are being invoked while not holding LOCK_global_system_variables and thus they do not need any changes. The following statements are affected by this: SET GLOBAL innodb_adaptive_hash_index = …; SET GLOBAL innodb_cmp_per_index_enabled = 1; SET GLOBAL innodb_old_blocks_pct = …; SET GLOBAL innodb_fil_make_page_dirty_debug = …; -- debug builds only SET GLOBAL innodb_buffer_pool_evict = uncompressed; -- debug builds only SET GLOBAL innodb_purge_run_now = 1; -- debug builds only SET GLOBAL innodb_purge_stop_now = 1; -- debug builds only SET GLOBAL innodb_log_checkpoint_now = 1; -- debug builds only SET GLOBAL innodb_buf_flush_list_now = 1; -- debug builds only SET GLOBAL innodb_buffer_pool_dump_now = 1; SET GLOBAL innodb_buffer_pool_load_now = 1; SET GLOBAL innodb_buffer_pool_load_abort = 1; SET GLOBAL innodb_status_output = …; SET GLOBAL innodb_status_output_locks = …; SET GLOBAL innodb_encryption_threads = …; SET GLOBAL innodb_encryption_rotate_key_age = …; SET GLOBAL innodb_encryption_rotation_iops = …; SET GLOBAL innodb_encrypt_tables = …; SET GLOBAL innodb_disallow_writes = …; buf_LRU_old_ratio_update(): Correct the return type.
7 years ago
MDEV-19614 SET GLOBAL innodb_ deadlock due to LOCK_global_system_variables The update callback functions for several settable global InnoDB variables are acquiring InnoDB latches while holding LOCK_global_system_variables. On the other hand, some InnoDB code is invoking THDVAR() while holding InnoDB latches. An example of this is thd_lock_wait_timeout() that is called by lock_rec_enqueue_waiting(). In some cases, the intern_sys_var_ptr() that is invoked by THDVAR() may acquire LOCK_global_system_variables, via sync_dynamic_session_variables(). In lock_rec_enqueue_waiting(), we really must be holding some InnoDB latch while invoking THDVAR(). This implies that LOCK_global_system_variables must conceptually reside below any InnoDB latch in the latching order. That in turns implies that the various update callback functions must release LOCK_global_system_variables before acquiring any InnoDB mutexes or rw-locks, and reacquire LOCK_global_system_variables later. The validate functions are being invoked while not holding LOCK_global_system_variables and thus they do not need any changes. The following statements are affected by this: SET GLOBAL innodb_adaptive_hash_index = …; SET GLOBAL innodb_cmp_per_index_enabled = 1; SET GLOBAL innodb_old_blocks_pct = …; SET GLOBAL innodb_fil_make_page_dirty_debug = …; -- debug builds only SET GLOBAL innodb_buffer_pool_evict = uncompressed; -- debug builds only SET GLOBAL innodb_purge_run_now = 1; -- debug builds only SET GLOBAL innodb_purge_stop_now = 1; -- debug builds only SET GLOBAL innodb_log_checkpoint_now = 1; -- debug builds only SET GLOBAL innodb_buf_flush_list_now = 1; -- debug builds only SET GLOBAL innodb_buffer_pool_dump_now = 1; SET GLOBAL innodb_buffer_pool_load_now = 1; SET GLOBAL innodb_buffer_pool_load_abort = 1; SET GLOBAL innodb_status_output = …; SET GLOBAL innodb_status_output_locks = …; SET GLOBAL innodb_encryption_threads = …; SET GLOBAL innodb_encryption_rotate_key_age = …; SET GLOBAL innodb_encryption_rotation_iops = …; SET GLOBAL innodb_encrypt_tables = …; SET GLOBAL innodb_disallow_writes = …; buf_LRU_old_ratio_update(): Correct the return type.
7 years ago
  1. /*****************************************************************************
  2. Copyright (c) 1995, 2016, Oracle and/or its affiliates. All Rights Reserved.
  3. Copyright (c) 2017, 2019, MariaDB Corporation.
  4. This program is free software; you can redistribute it and/or modify it under
  5. the terms of the GNU General Public License as published by the Free Software
  6. Foundation; version 2 of the License.
  7. This program is distributed in the hope that it will be useful, but WITHOUT
  8. ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
  9. FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
  10. You should have received a copy of the GNU General Public License along with
  11. this program; if not, write to the Free Software Foundation, Inc.,
  12. 51 Franklin Street, Fifth Floor, Boston, MA 02110-1335 USA
  13. *****************************************************************************/
  14. /**************************************************//**
  15. @file buf/buf0lru.cc
  16. The database buffer replacement algorithm
  17. Created 11/5/1995 Heikki Tuuri
  18. *******************************************************/
  19. #include "buf0lru.h"
  20. #include "ut0byte.h"
  21. #include "ut0rnd.h"
  22. #include "sync0rw.h"
  23. #include "hash0hash.h"
  24. #include "os0event.h"
  25. #include "fil0fil.h"
  26. #include "btr0btr.h"
  27. #include "buf0buddy.h"
  28. #include "buf0buf.h"
  29. #include "buf0dblwr.h"
  30. #include "buf0flu.h"
  31. #include "buf0rea.h"
  32. #include "btr0sea.h"
  33. #include "ibuf0ibuf.h"
  34. #include "os0file.h"
  35. #include "page0zip.h"
  36. #include "log0recv.h"
  37. #include "srv0srv.h"
  38. #include "srv0mon.h"
  39. /** The number of blocks from the LRU_old pointer onward, including
  40. the block pointed to, must be buf_pool->LRU_old_ratio/BUF_LRU_OLD_RATIO_DIV
  41. of the whole LRU list length, except that the tolerance defined below
  42. is allowed. Note that the tolerance must be small enough such that for
  43. even the BUF_LRU_OLD_MIN_LEN long LRU list, the LRU_old pointer is not
  44. allowed to point to either end of the LRU list. */
  45. static const ulint BUF_LRU_OLD_TOLERANCE = 20;
  46. /** The minimum amount of non-old blocks when the LRU_old list exists
  47. (that is, when there are more than BUF_LRU_OLD_MIN_LEN blocks).
  48. @see buf_LRU_old_adjust_len */
  49. #define BUF_LRU_NON_OLD_MIN_LEN 5
  50. /** When dropping the search hash index entries before deleting an ibd
  51. file, we build a local array of pages belonging to that tablespace
  52. in the buffer pool. Following is the size of that array.
  53. We also release buf_pool->mutex after scanning this many pages of the
  54. flush_list when dropping a table. This is to ensure that other threads
  55. are not blocked for extended period of time when using very large
  56. buffer pools. */
  57. static const ulint BUF_LRU_DROP_SEARCH_SIZE = 1024;
  58. /** We scan these many blocks when looking for a clean page to evict
  59. during LRU eviction. */
  60. static const ulint BUF_LRU_SEARCH_SCAN_THRESHOLD = 100;
  61. /** If we switch on the InnoDB monitor because there are too few available
  62. frames in the buffer pool, we set this to TRUE */
  63. static bool buf_lru_switched_on_innodb_mon = false;
  64. /** True if diagnostic message about difficult to find free blocks
  65. in the buffer bool has already printed. */
  66. static bool buf_lru_free_blocks_error_printed;
  67. /******************************************************************//**
  68. These statistics are not 'of' LRU but 'for' LRU. We keep count of I/O
  69. and page_zip_decompress() operations. Based on the statistics,
  70. buf_LRU_evict_from_unzip_LRU() decides if we want to evict from
  71. unzip_LRU or the regular LRU. From unzip_LRU, we will only evict the
  72. uncompressed frame (meaning we can evict dirty blocks as well). From
  73. the regular LRU, we will evict the entire block (i.e.: both the
  74. uncompressed and compressed data), which must be clean. */
  75. /* @{ */
  76. /** Number of intervals for which we keep the history of these stats.
  77. Each interval is 1 second, defined by the rate at which
  78. srv_error_monitor_thread() calls buf_LRU_stat_update(). */
  79. static const ulint BUF_LRU_STAT_N_INTERVAL = 50;
  80. /** Co-efficient with which we multiply I/O operations to equate them
  81. with page_zip_decompress() operations. */
  82. static const ulint BUF_LRU_IO_TO_UNZIP_FACTOR = 50;
  83. /** Sampled values buf_LRU_stat_cur.
  84. Not protected by any mutex. Updated by buf_LRU_stat_update(). */
  85. static buf_LRU_stat_t buf_LRU_stat_arr[BUF_LRU_STAT_N_INTERVAL];
  86. /** Cursor to buf_LRU_stat_arr[] that is updated in a round-robin fashion. */
  87. static ulint buf_LRU_stat_arr_ind;
  88. /** Current operation counters. Not protected by any mutex. Cleared
  89. by buf_LRU_stat_update(). */
  90. buf_LRU_stat_t buf_LRU_stat_cur;
  91. /** Running sum of past values of buf_LRU_stat_cur.
  92. Updated by buf_LRU_stat_update(). Not Protected by any mutex. */
  93. buf_LRU_stat_t buf_LRU_stat_sum;
  94. /* @} */
  95. /** @name Heuristics for detecting index scan @{ */
  96. /** Move blocks to "new" LRU list only if the first access was at
  97. least this many milliseconds ago. Not protected by any mutex or latch. */
  98. uint buf_LRU_old_threshold_ms;
  99. /* @} */
  100. /******************************************************************//**
  101. Takes a block out of the LRU list and page hash table.
  102. If the block is compressed-only (BUF_BLOCK_ZIP_PAGE),
  103. the object will be freed.
  104. The caller must hold buf_pool->mutex, the buf_page_get_mutex() mutex
  105. and the appropriate hash_lock. This function will release the
  106. buf_page_get_mutex() and the hash_lock.
  107. If a compressed page is freed other compressed pages may be relocated.
  108. @retval true if BUF_BLOCK_FILE_PAGE was removed from page_hash. The
  109. caller needs to free the page to the free list
  110. @retval false if BUF_BLOCK_ZIP_PAGE was removed from page_hash. In
  111. this case the block is already returned to the buddy allocator. */
  112. static MY_ATTRIBUTE((warn_unused_result))
  113. bool
  114. buf_LRU_block_remove_hashed(
  115. /*========================*/
  116. buf_page_t* bpage, /*!< in: block, must contain a file page and
  117. be in a state where it can be freed; there
  118. may or may not be a hash index to the page */
  119. bool zip); /*!< in: true if should remove also the
  120. compressed page of an uncompressed page */
  121. /******************************************************************//**
  122. Puts a file page whose has no hash index to the free list. */
  123. static
  124. void
  125. buf_LRU_block_free_hashed_page(
  126. /*===========================*/
  127. buf_block_t* block); /*!< in: block, must contain a file page and
  128. be in a state where it can be freed */
  129. /******************************************************************//**
  130. Increases LRU size in bytes with page size inline function */
  131. static inline
  132. void
  133. incr_LRU_size_in_bytes(
  134. /*===================*/
  135. buf_page_t* bpage, /*!< in: control block */
  136. buf_pool_t* buf_pool) /*!< in: buffer pool instance */
  137. {
  138. ut_ad(buf_pool_mutex_own(buf_pool));
  139. buf_pool->stat.LRU_bytes += bpage->physical_size();
  140. ut_ad(buf_pool->stat.LRU_bytes <= buf_pool->curr_pool_size);
  141. }
  142. /******************************************************************//**
  143. Determines if the unzip_LRU list should be used for evicting a victim
  144. instead of the general LRU list.
  145. @return TRUE if should use unzip_LRU */
  146. ibool
  147. buf_LRU_evict_from_unzip_LRU(
  148. /*=========================*/
  149. buf_pool_t* buf_pool)
  150. {
  151. ut_ad(buf_pool_mutex_own(buf_pool));
  152. /* If the unzip_LRU list is empty, we can only use the LRU. */
  153. if (UT_LIST_GET_LEN(buf_pool->unzip_LRU) == 0) {
  154. return(FALSE);
  155. }
  156. /* If unzip_LRU is at most 10% of the size of the LRU list,
  157. then use the LRU. This slack allows us to keep hot
  158. decompressed pages in the buffer pool. */
  159. if (UT_LIST_GET_LEN(buf_pool->unzip_LRU)
  160. <= UT_LIST_GET_LEN(buf_pool->LRU) / 10) {
  161. return(FALSE);
  162. }
  163. /* If eviction hasn't started yet, we assume by default
  164. that a workload is disk bound. */
  165. if (buf_pool->freed_page_clock == 0) {
  166. return(TRUE);
  167. }
  168. /* Calculate the average over past intervals, and add the values
  169. of the current interval. */
  170. ulint io_avg = buf_LRU_stat_sum.io / BUF_LRU_STAT_N_INTERVAL
  171. + buf_LRU_stat_cur.io;
  172. ulint unzip_avg = buf_LRU_stat_sum.unzip / BUF_LRU_STAT_N_INTERVAL
  173. + buf_LRU_stat_cur.unzip;
  174. /* Decide based on our formula. If the load is I/O bound
  175. (unzip_avg is smaller than the weighted io_avg), evict an
  176. uncompressed frame from unzip_LRU. Otherwise we assume that
  177. the load is CPU bound and evict from the regular LRU. */
  178. return(unzip_avg <= io_avg * BUF_LRU_IO_TO_UNZIP_FACTOR);
  179. }
  180. #ifdef BTR_CUR_HASH_ADAPT
  181. /** Attempts to drop page hash index on a batch of pages belonging to a
  182. particular space id.
  183. @param[in] space_id space id
  184. @param[in] arr array of page_no
  185. @param[in] count number of entries in array */
  186. static
  187. void
  188. buf_LRU_drop_page_hash_batch(ulint space_id, const ulint* arr, ulint count)
  189. {
  190. ut_ad(count <= BUF_LRU_DROP_SEARCH_SIZE);
  191. for (const ulint* const end = arr + count; arr != end; ) {
  192. /* While our only caller
  193. buf_LRU_drop_page_hash_for_tablespace()
  194. is being executed for DROP TABLE or similar,
  195. the table cannot be evicted from the buffer pool. */
  196. btr_search_drop_page_hash_when_freed(
  197. page_id_t(space_id, *arr++));
  198. }
  199. }
  200. /******************************************************************//**
  201. When doing a DROP TABLE/DISCARD TABLESPACE we have to drop all page
  202. hash index entries belonging to that table. This function tries to
  203. do that in batch. Note that this is a 'best effort' attempt and does
  204. not guarantee that ALL hash entries will be removed. */
  205. static
  206. void
  207. buf_LRU_drop_page_hash_for_tablespace(
  208. /*==================================*/
  209. buf_pool_t* buf_pool, /*!< in: buffer pool instance */
  210. ulint id) /*!< in: space id */
  211. {
  212. ulint* page_arr = static_cast<ulint*>(ut_malloc_nokey(
  213. sizeof(ulint) * BUF_LRU_DROP_SEARCH_SIZE));
  214. ulint num_entries = 0;
  215. buf_pool_mutex_enter(buf_pool);
  216. scan_again:
  217. for (buf_page_t* bpage = UT_LIST_GET_LAST(buf_pool->LRU);
  218. bpage != NULL;
  219. /* No op */) {
  220. buf_page_t* prev_bpage = UT_LIST_GET_PREV(LRU, bpage);
  221. ut_a(buf_page_in_file(bpage));
  222. if (buf_page_get_state(bpage) != BUF_BLOCK_FILE_PAGE
  223. || bpage->id.space() != id
  224. || bpage->io_fix != BUF_IO_NONE) {
  225. /* Compressed pages are never hashed.
  226. Skip blocks of other tablespaces.
  227. Skip I/O-fixed blocks (to be dealt with later). */
  228. next_page:
  229. bpage = prev_bpage;
  230. continue;
  231. }
  232. buf_block_t* block = reinterpret_cast<buf_block_t*>(bpage);
  233. mutex_enter(&block->mutex);
  234. /* This debug check uses a dirty read that could
  235. theoretically cause false positives while
  236. buf_pool_clear_hash_index() is executing.
  237. (Other conflicting access paths to the adaptive hash
  238. index should not be possible, because when a
  239. tablespace is being discarded or dropped, there must
  240. be no concurrect access to the contained tables.) */
  241. assert_block_ahi_valid(block);
  242. bool skip = bpage->buf_fix_count > 0 || !block->index;
  243. mutex_exit(&block->mutex);
  244. if (skip) {
  245. /* Skip this block, because there are
  246. no adaptive hash index entries
  247. pointing to it, or because we cannot
  248. drop them due to the buffer-fix. */
  249. goto next_page;
  250. }
  251. /* Store the page number so that we can drop the hash
  252. index in a batch later. */
  253. page_arr[num_entries] = bpage->id.page_no();
  254. ut_a(num_entries < BUF_LRU_DROP_SEARCH_SIZE);
  255. ++num_entries;
  256. if (num_entries < BUF_LRU_DROP_SEARCH_SIZE) {
  257. goto next_page;
  258. }
  259. /* Array full. We release the buf_pool->mutex to obey
  260. the latching order. */
  261. buf_pool_mutex_exit(buf_pool);
  262. buf_LRU_drop_page_hash_batch(id, page_arr, num_entries);
  263. num_entries = 0;
  264. buf_pool_mutex_enter(buf_pool);
  265. /* Note that we released the buf_pool mutex above
  266. after reading the prev_bpage during processing of a
  267. page_hash_batch (i.e.: when the array was full).
  268. Because prev_bpage could belong to a compressed-only
  269. block, it may have been relocated, and thus the
  270. pointer cannot be trusted. Because bpage is of type
  271. buf_block_t, it is safe to dereference.
  272. bpage can change in the LRU list. This is OK because
  273. this function is a 'best effort' to drop as many
  274. search hash entries as possible and it does not
  275. guarantee that ALL such entries will be dropped. */
  276. /* If, however, bpage has been removed from LRU list
  277. to the free list then we should restart the scan.
  278. bpage->state is protected by buf_pool mutex. */
  279. if (bpage != NULL
  280. && buf_page_get_state(bpage) != BUF_BLOCK_FILE_PAGE) {
  281. goto scan_again;
  282. }
  283. }
  284. buf_pool_mutex_exit(buf_pool);
  285. /* Drop any remaining batch of search hashed pages. */
  286. buf_LRU_drop_page_hash_batch(id, page_arr, num_entries);
  287. ut_free(page_arr);
  288. }
  289. /** Try to drop the adaptive hash index for a tablespace.
  290. @param[in,out] table table
  291. @return whether anything was dropped */
  292. bool buf_LRU_drop_page_hash_for_tablespace(dict_table_t* table)
  293. {
  294. for (dict_index_t* index = dict_table_get_first_index(table);
  295. index != NULL;
  296. index = dict_table_get_next_index(index)) {
  297. if (btr_search_info_get_ref_count(btr_search_get_info(index),
  298. index)) {
  299. goto drop_ahi;
  300. }
  301. }
  302. return false;
  303. drop_ahi:
  304. ulint id = table->space_id;
  305. for (ulint i = 0; i < srv_buf_pool_instances; i++) {
  306. buf_LRU_drop_page_hash_for_tablespace(buf_pool_from_array(i),
  307. id);
  308. }
  309. return true;
  310. }
  311. #endif /* BTR_CUR_HASH_ADAPT */
  312. /******************************************************************//**
  313. While flushing (or removing dirty) pages from a tablespace we don't
  314. want to hog the CPU and resources. Release the buffer pool and block
  315. mutex and try to force a context switch. Then reacquire the same mutexes.
  316. The current page is "fixed" before the release of the mutexes and then
  317. "unfixed" again once we have reacquired the mutexes. */
  318. static
  319. void
  320. buf_flush_yield(
  321. /*============*/
  322. buf_pool_t* buf_pool, /*!< in/out: buffer pool instance */
  323. buf_page_t* bpage) /*!< in/out: current page */
  324. {
  325. BPageMutex* block_mutex;
  326. ut_ad(buf_pool_mutex_own(buf_pool));
  327. ut_ad(buf_page_in_file(bpage));
  328. block_mutex = buf_page_get_mutex(bpage);
  329. mutex_enter(block_mutex);
  330. /* "Fix" the block so that the position cannot be
  331. changed after we release the buffer pool and
  332. block mutexes. */
  333. buf_page_set_sticky(bpage);
  334. /* Now it is safe to release the buf_pool->mutex. */
  335. buf_pool_mutex_exit(buf_pool);
  336. mutex_exit(block_mutex);
  337. /* Try and force a context switch. */
  338. os_thread_yield();
  339. buf_pool_mutex_enter(buf_pool);
  340. mutex_enter(block_mutex);
  341. /* "Unfix" the block now that we have both the
  342. buffer pool and block mutex again. */
  343. buf_page_unset_sticky(bpage);
  344. mutex_exit(block_mutex);
  345. }
  346. /******************************************************************//**
  347. If we have hogged the resources for too long then release the buffer
  348. pool and flush list mutex and do a thread yield. Set the current page
  349. to "sticky" so that it is not relocated during the yield.
  350. @return true if yielded */
  351. static MY_ATTRIBUTE((warn_unused_result))
  352. bool
  353. buf_flush_try_yield(
  354. /*================*/
  355. buf_pool_t* buf_pool, /*!< in/out: buffer pool instance */
  356. buf_page_t* bpage, /*!< in/out: bpage to remove */
  357. ulint processed) /*!< in: number of pages processed */
  358. {
  359. /* Every BUF_LRU_DROP_SEARCH_SIZE iterations in the
  360. loop we release buf_pool->mutex to let other threads
  361. do their job but only if the block is not IO fixed. This
  362. ensures that the block stays in its position in the
  363. flush_list. */
  364. if (bpage != NULL
  365. && processed >= BUF_LRU_DROP_SEARCH_SIZE
  366. && buf_page_get_io_fix(bpage) == BUF_IO_NONE) {
  367. buf_flush_list_mutex_exit(buf_pool);
  368. /* Release the buffer pool and block mutex
  369. to give the other threads a go. */
  370. buf_flush_yield(buf_pool, bpage);
  371. buf_flush_list_mutex_enter(buf_pool);
  372. /* Should not have been removed from the flush
  373. list during the yield. However, this check is
  374. not sufficient to catch a remove -> add. */
  375. ut_ad(bpage->in_flush_list);
  376. return(true);
  377. }
  378. return(false);
  379. }
  380. /******************************************************************//**
  381. Removes a single page from a given tablespace inside a specific
  382. buffer pool instance.
  383. @return true if page was removed. */
  384. static MY_ATTRIBUTE((warn_unused_result))
  385. bool
  386. buf_flush_or_remove_page(
  387. /*=====================*/
  388. buf_pool_t* buf_pool, /*!< in/out: buffer pool instance */
  389. buf_page_t* bpage, /*!< in/out: bpage to remove */
  390. bool flush) /*!< in: flush to disk if true but
  391. don't remove else remove without
  392. flushing to disk */
  393. {
  394. ut_ad(buf_pool_mutex_own(buf_pool));
  395. ut_ad(buf_flush_list_mutex_own(buf_pool));
  396. /* bpage->space and bpage->io_fix are protected by
  397. buf_pool->mutex and block_mutex. It is safe to check
  398. them while holding buf_pool->mutex only. */
  399. if (buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
  400. /* We cannot remove this page during this scan
  401. yet; maybe the system is currently reading it
  402. in, or flushing the modifications to the file */
  403. return(false);
  404. }
  405. BPageMutex* block_mutex;
  406. bool processed = false;
  407. block_mutex = buf_page_get_mutex(bpage);
  408. /* We have to release the flush_list_mutex to obey the
  409. latching order. We are however guaranteed that the page
  410. will stay in the flush_list and won't be relocated because
  411. buf_flush_remove() and buf_flush_relocate_on_flush_list()
  412. need buf_pool->mutex as well. */
  413. buf_flush_list_mutex_exit(buf_pool);
  414. mutex_enter(block_mutex);
  415. ut_ad(bpage->oldest_modification != 0);
  416. if (!flush) {
  417. buf_flush_remove(bpage);
  418. mutex_exit(block_mutex);
  419. processed = true;
  420. } else if (buf_flush_ready_for_flush(bpage, BUF_FLUSH_SINGLE_PAGE)) {
  421. /* The following call will release the buffer pool
  422. and block mutex. */
  423. processed = buf_flush_page(
  424. buf_pool, bpage, BUF_FLUSH_SINGLE_PAGE, false);
  425. if (processed) {
  426. buf_pool_mutex_enter(buf_pool);
  427. } else {
  428. mutex_exit(block_mutex);
  429. }
  430. } else {
  431. mutex_exit(block_mutex);
  432. }
  433. buf_flush_list_mutex_enter(buf_pool);
  434. ut_ad(!mutex_own(block_mutex));
  435. ut_ad(buf_pool_mutex_own(buf_pool));
  436. return(processed);
  437. }
  438. /** Remove all dirty pages belonging to a given tablespace inside a specific
  439. buffer pool instance when we are deleting the data file(s) of that
  440. tablespace. The pages still remain a part of LRU and are evicted from
  441. the list as they age towards the tail of the LRU.
  442. @param[in,out] buf_pool buffer pool
  443. @param[in] id tablespace identifier
  444. @param[in] observer flush observer (to check for interrupt),
  445. or NULL if the files should not be written to
  446. @param[in] first first page to be flushed or evicted
  447. @return whether all matching dirty pages were removed */
  448. static MY_ATTRIBUTE((warn_unused_result))
  449. bool
  450. buf_flush_or_remove_pages(
  451. buf_pool_t* buf_pool,
  452. ulint id,
  453. FlushObserver* observer,
  454. ulint first)
  455. {
  456. buf_page_t* prev;
  457. buf_page_t* bpage;
  458. ulint processed = 0;
  459. buf_flush_list_mutex_enter(buf_pool);
  460. rescan:
  461. bool all_freed = true;
  462. for (bpage = UT_LIST_GET_LAST(buf_pool->flush_list);
  463. bpage != NULL;
  464. bpage = prev) {
  465. ut_a(buf_page_in_file(bpage));
  466. /* Save the previous link because once we free the
  467. page we can't rely on the links. */
  468. prev = UT_LIST_GET_PREV(list, bpage);
  469. /* Flush the pages matching space id,
  470. or pages matching the flush observer. */
  471. if (observer && observer->is_partial_flush()) {
  472. if (observer != bpage->flush_observer) {
  473. /* Skip this block. */
  474. } else if (!buf_flush_or_remove_page(
  475. buf_pool, bpage,
  476. !observer->is_interrupted())) {
  477. all_freed = false;
  478. } else if (!observer->is_interrupted()) {
  479. /* The processing was successful. And during the
  480. processing we have released the buf_pool mutex
  481. when calling buf_page_flush(). We cannot trust
  482. prev pointer. */
  483. goto rescan;
  484. }
  485. } else if (id != bpage->id.space()) {
  486. /* Skip this block, because it is for a
  487. different tablespace. */
  488. } else if (bpage->id.page_no() < first) {
  489. /* Skip this block, because it is below the limit. */
  490. } else if (!buf_flush_or_remove_page(
  491. buf_pool, bpage, observer != NULL)) {
  492. /* Remove was unsuccessful, we have to try again
  493. by scanning the entire list from the end.
  494. This also means that we never released the
  495. buf_pool mutex. Therefore we can trust the prev
  496. pointer.
  497. buf_flush_or_remove_page() released the
  498. flush list mutex but not the buf_pool mutex.
  499. Therefore it is possible that a new page was
  500. added to the flush list. For example, in case
  501. where we are at the head of the flush list and
  502. prev == NULL. That is OK because we have the
  503. tablespace quiesced and no new pages for this
  504. space-id should enter flush_list. This is
  505. because the only callers of this function are
  506. DROP TABLE and FLUSH TABLE FOR EXPORT.
  507. We know that we'll have to do at least one more
  508. scan but we don't break out of loop here and
  509. try to do as much work as we can in this
  510. iteration. */
  511. all_freed = false;
  512. } else if (observer) {
  513. /* The processing was successful. And during the
  514. processing we have released the buf_pool mutex
  515. when calling buf_page_flush(). We cannot trust
  516. prev pointer. */
  517. goto rescan;
  518. }
  519. ++processed;
  520. /* Yield if we have hogged the CPU and mutexes for too long. */
  521. if (buf_flush_try_yield(buf_pool, prev, processed)) {
  522. /* Reset the batch size counter if we had to yield. */
  523. processed = 0;
  524. }
  525. /* The check for trx is interrupted is expensive, we want
  526. to check every N iterations. */
  527. if (!processed && observer) {
  528. observer->check_interrupted();
  529. }
  530. }
  531. buf_flush_list_mutex_exit(buf_pool);
  532. return(all_freed);
  533. }
  534. /** Remove or flush all the dirty pages that belong to a given tablespace
  535. inside a specific buffer pool instance. The pages will remain in the LRU
  536. list and will be evicted from the LRU list as they age and move towards
  537. the tail of the LRU list.
  538. @param[in,out] buf_pool buffer pool
  539. @param[in] id tablespace identifier
  540. @param[in] observer flush observer,
  541. or NULL if the files should not be written to
  542. @param[in] first first page to be flushed or evicted */
  543. static
  544. void
  545. buf_flush_dirty_pages(
  546. buf_pool_t* buf_pool,
  547. ulint id,
  548. FlushObserver* observer,
  549. ulint first)
  550. {
  551. for (;;) {
  552. buf_pool_mutex_enter(buf_pool);
  553. bool freed = buf_flush_or_remove_pages(buf_pool, id, observer,
  554. first);
  555. buf_pool_mutex_exit(buf_pool);
  556. ut_ad(buf_flush_validate(buf_pool));
  557. if (freed) {
  558. break;
  559. }
  560. os_thread_sleep(2000);
  561. ut_ad(buf_flush_validate(buf_pool));
  562. }
  563. ut_ad((observer && observer->is_interrupted())
  564. || first
  565. || buf_pool_get_dirty_pages_count(buf_pool, id, observer) == 0);
  566. }
  567. /** Empty the flush list for all pages belonging to a tablespace.
  568. @param[in] id tablespace identifier
  569. @param[in] observer flush observer,
  570. or NULL if nothing is to be written
  571. @param[in] first first page to be flushed or evicted */
  572. void buf_LRU_flush_or_remove_pages(ulint id, FlushObserver* observer,
  573. ulint first)
  574. {
  575. /* Pages in the system tablespace must never be discarded. */
  576. ut_ad(id || observer);
  577. for (ulint i = 0; i < srv_buf_pool_instances; i++) {
  578. buf_flush_dirty_pages(buf_pool_from_array(i), id, observer,
  579. first);
  580. }
  581. if (observer && !observer->is_interrupted()) {
  582. /* Ensure that all asynchronous IO is completed. */
  583. os_aio_wait_until_no_pending_writes();
  584. fil_flush(id);
  585. }
  586. }
  587. #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
  588. /********************************************************************//**
  589. Insert a compressed block into buf_pool->zip_clean in the LRU order. */
  590. void
  591. buf_LRU_insert_zip_clean(
  592. /*=====================*/
  593. buf_page_t* bpage) /*!< in: pointer to the block in question */
  594. {
  595. buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
  596. ut_ad(buf_pool_mutex_own(buf_pool));
  597. ut_ad(buf_page_get_state(bpage) == BUF_BLOCK_ZIP_PAGE);
  598. /* Find the first successor of bpage in the LRU list
  599. that is in the zip_clean list. */
  600. buf_page_t* b = bpage;
  601. do {
  602. b = UT_LIST_GET_NEXT(LRU, b);
  603. } while (b && buf_page_get_state(b) != BUF_BLOCK_ZIP_PAGE);
  604. /* Insert bpage before b, i.e., after the predecessor of b. */
  605. if (b != NULL) {
  606. b = UT_LIST_GET_PREV(list, b);
  607. }
  608. if (b != NULL) {
  609. UT_LIST_INSERT_AFTER(buf_pool->zip_clean, b, bpage);
  610. } else {
  611. UT_LIST_ADD_FIRST(buf_pool->zip_clean, bpage);
  612. }
  613. }
  614. #endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */
  615. /******************************************************************//**
  616. Try to free an uncompressed page of a compressed block from the unzip
  617. LRU list. The compressed page is preserved, and it need not be clean.
  618. @return true if freed */
  619. static
  620. bool
  621. buf_LRU_free_from_unzip_LRU_list(
  622. /*=============================*/
  623. buf_pool_t* buf_pool, /*!< in: buffer pool instance */
  624. bool scan_all) /*!< in: scan whole LRU list
  625. if true, otherwise scan only
  626. srv_LRU_scan_depth / 2 blocks. */
  627. {
  628. ut_ad(buf_pool_mutex_own(buf_pool));
  629. if (!buf_LRU_evict_from_unzip_LRU(buf_pool)) {
  630. return(false);
  631. }
  632. ulint scanned = 0;
  633. bool freed = false;
  634. for (buf_block_t* block = UT_LIST_GET_LAST(buf_pool->unzip_LRU);
  635. block != NULL
  636. && !freed
  637. && (scan_all || scanned < srv_LRU_scan_depth);
  638. ++scanned) {
  639. buf_block_t* prev_block;
  640. prev_block = UT_LIST_GET_PREV(unzip_LRU, block);
  641. ut_ad(buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE);
  642. ut_ad(block->in_unzip_LRU_list);
  643. ut_ad(block->page.in_LRU_list);
  644. freed = buf_LRU_free_page(&block->page, false);
  645. block = prev_block;
  646. }
  647. if (scanned) {
  648. MONITOR_INC_VALUE_CUMULATIVE(
  649. MONITOR_LRU_UNZIP_SEARCH_SCANNED,
  650. MONITOR_LRU_UNZIP_SEARCH_SCANNED_NUM_CALL,
  651. MONITOR_LRU_UNZIP_SEARCH_SCANNED_PER_CALL,
  652. scanned);
  653. }
  654. return(freed);
  655. }
  656. /******************************************************************//**
  657. Try to free a clean page from the common LRU list.
  658. @return true if freed */
  659. static
  660. bool
  661. buf_LRU_free_from_common_LRU_list(
  662. /*==============================*/
  663. buf_pool_t* buf_pool, /*!< in: buffer pool instance */
  664. bool scan_all) /*!< in: scan whole LRU list
  665. if true, otherwise scan only
  666. up to BUF_LRU_SEARCH_SCAN_THRESHOLD */
  667. {
  668. ut_ad(buf_pool_mutex_own(buf_pool));
  669. ulint scanned = 0;
  670. bool freed = false;
  671. for (buf_page_t* bpage = buf_pool->lru_scan_itr.start();
  672. bpage != NULL
  673. && !freed
  674. && (scan_all || scanned < BUF_LRU_SEARCH_SCAN_THRESHOLD);
  675. ++scanned, bpage = buf_pool->lru_scan_itr.get()) {
  676. buf_page_t* prev = UT_LIST_GET_PREV(LRU, bpage);
  677. BPageMutex* mutex = buf_page_get_mutex(bpage);
  678. buf_pool->lru_scan_itr.set(prev);
  679. mutex_enter(mutex);
  680. ut_ad(buf_page_in_file(bpage));
  681. ut_ad(bpage->in_LRU_list);
  682. unsigned accessed = buf_page_is_accessed(bpage);
  683. if (buf_flush_ready_for_replace(bpage)) {
  684. mutex_exit(mutex);
  685. freed = buf_LRU_free_page(bpage, true);
  686. } else {
  687. mutex_exit(mutex);
  688. }
  689. if (freed && !accessed) {
  690. /* Keep track of pages that are evicted without
  691. ever being accessed. This gives us a measure of
  692. the effectiveness of readahead */
  693. ++buf_pool->stat.n_ra_pages_evicted;
  694. }
  695. ut_ad(buf_pool_mutex_own(buf_pool));
  696. ut_ad(!mutex_own(mutex));
  697. }
  698. if (scanned) {
  699. MONITOR_INC_VALUE_CUMULATIVE(
  700. MONITOR_LRU_SEARCH_SCANNED,
  701. MONITOR_LRU_SEARCH_SCANNED_NUM_CALL,
  702. MONITOR_LRU_SEARCH_SCANNED_PER_CALL,
  703. scanned);
  704. }
  705. return(freed);
  706. }
  707. /******************************************************************//**
  708. Try to free a replaceable block.
  709. @return true if found and freed */
  710. bool
  711. buf_LRU_scan_and_free_block(
  712. /*========================*/
  713. buf_pool_t* buf_pool, /*!< in: buffer pool instance */
  714. bool scan_all) /*!< in: scan whole LRU list
  715. if true, otherwise scan only
  716. BUF_LRU_SEARCH_SCAN_THRESHOLD
  717. blocks. */
  718. {
  719. ut_ad(buf_pool_mutex_own(buf_pool));
  720. return(buf_LRU_free_from_unzip_LRU_list(buf_pool, scan_all)
  721. || buf_LRU_free_from_common_LRU_list(buf_pool, scan_all));
  722. }
  723. /******************************************************************//**
  724. Returns TRUE if less than 25 % of the buffer pool in any instance is
  725. available. This can be used in heuristics to prevent huge transactions
  726. eating up the whole buffer pool for their locks.
  727. @return TRUE if less than 25 % of buffer pool left */
  728. ibool
  729. buf_LRU_buf_pool_running_out(void)
  730. /*==============================*/
  731. {
  732. ibool ret = FALSE;
  733. for (ulint i = 0; i < srv_buf_pool_instances && !ret; i++) {
  734. buf_pool_t* buf_pool;
  735. buf_pool = buf_pool_from_array(i);
  736. buf_pool_mutex_enter(buf_pool);
  737. if (!recv_recovery_is_on()
  738. && UT_LIST_GET_LEN(buf_pool->free)
  739. + UT_LIST_GET_LEN(buf_pool->LRU)
  740. < ut_min(buf_pool->curr_size,
  741. buf_pool->old_size) / 4) {
  742. ret = TRUE;
  743. }
  744. buf_pool_mutex_exit(buf_pool);
  745. }
  746. return(ret);
  747. }
  748. /******************************************************************//**
  749. Returns a free block from the buf_pool. The block is taken off the
  750. free list. If it is empty, returns NULL.
  751. @return a free control block, or NULL if the buf_block->free list is empty */
  752. buf_block_t*
  753. buf_LRU_get_free_only(
  754. /*==================*/
  755. buf_pool_t* buf_pool)
  756. {
  757. buf_block_t* block;
  758. ut_ad(buf_pool_mutex_own(buf_pool));
  759. block = reinterpret_cast<buf_block_t*>(
  760. UT_LIST_GET_FIRST(buf_pool->free));
  761. while (block != NULL) {
  762. ut_ad(block->page.in_free_list);
  763. ut_d(block->page.in_free_list = FALSE);
  764. ut_ad(!block->page.in_flush_list);
  765. ut_ad(!block->page.in_LRU_list);
  766. ut_a(!buf_page_in_file(&block->page));
  767. UT_LIST_REMOVE(buf_pool->free, &block->page);
  768. if (buf_pool->curr_size >= buf_pool->old_size
  769. || UT_LIST_GET_LEN(buf_pool->withdraw)
  770. >= buf_pool->withdraw_target
  771. || !buf_block_will_withdrawn(buf_pool, block)) {
  772. /* found valid free block */
  773. buf_page_mutex_enter(block);
  774. /* No adaptive hash index entries may point to
  775. a free block. */
  776. assert_block_ahi_empty(block);
  777. buf_block_set_state(block, BUF_BLOCK_READY_FOR_USE);
  778. UNIV_MEM_ALLOC(block->frame, srv_page_size);
  779. ut_ad(buf_pool_from_block(block) == buf_pool);
  780. buf_page_mutex_exit(block);
  781. break;
  782. }
  783. /* This should be withdrawn */
  784. UT_LIST_ADD_LAST(
  785. buf_pool->withdraw,
  786. &block->page);
  787. ut_d(block->in_withdraw_list = TRUE);
  788. block = reinterpret_cast<buf_block_t*>(
  789. UT_LIST_GET_FIRST(buf_pool->free));
  790. }
  791. return(block);
  792. }
  793. /******************************************************************//**
  794. Checks how much of buf_pool is occupied by non-data objects like
  795. AHI, lock heaps etc. Depending on the size of non-data objects this
  796. function will either assert or issue a warning and switch on the
  797. status monitor. */
  798. static
  799. void
  800. buf_LRU_check_size_of_non_data_objects(
  801. /*===================================*/
  802. const buf_pool_t* buf_pool) /*!< in: buffer pool instance */
  803. {
  804. ut_ad(buf_pool_mutex_own(buf_pool));
  805. if (!recv_recovery_is_on()
  806. && buf_pool->curr_size == buf_pool->old_size
  807. && UT_LIST_GET_LEN(buf_pool->free)
  808. + UT_LIST_GET_LEN(buf_pool->LRU) < buf_pool->curr_size / 20) {
  809. ib::fatal() << "Over 95 percent of the buffer pool is"
  810. " occupied by lock heaps"
  811. #ifdef BTR_CUR_HASH_ADAPT
  812. " or the adaptive hash index!"
  813. #endif /* BTR_CUR_HASH_ADAPT */
  814. " Check that your transactions do not set too many"
  815. " row locks, or review if"
  816. " innodb_buffer_pool_size="
  817. << (buf_pool->curr_size >> (20U - srv_page_size_shift))
  818. << "M could be bigger.";
  819. } else if (!recv_recovery_is_on()
  820. && buf_pool->curr_size == buf_pool->old_size
  821. && (UT_LIST_GET_LEN(buf_pool->free)
  822. + UT_LIST_GET_LEN(buf_pool->LRU))
  823. < buf_pool->curr_size / 3) {
  824. if (!buf_lru_switched_on_innodb_mon) {
  825. /* Over 67 % of the buffer pool is occupied by lock
  826. heaps or the adaptive hash index. This may be a memory
  827. leak! */
  828. ib::warn() << "Over 67 percent of the buffer pool is"
  829. " occupied by lock heaps"
  830. #ifdef BTR_CUR_HASH_ADAPT
  831. " or the adaptive hash index!"
  832. #endif /* BTR_CUR_HASH_ADAPT */
  833. " Check that your transactions do not"
  834. " set too many row locks."
  835. " innodb_buffer_pool_size="
  836. << (buf_pool->curr_size >>
  837. (20U - srv_page_size_shift)) << "M."
  838. " Starting the InnoDB Monitor to print"
  839. " diagnostics.";
  840. buf_lru_switched_on_innodb_mon = true;
  841. srv_print_innodb_monitor = TRUE;
  842. srv_monitor_timer_schedule_now();
  843. }
  844. } else if (buf_lru_switched_on_innodb_mon) {
  845. /* Switch off the InnoDB Monitor; this is a simple way
  846. to stop the monitor if the situation becomes less urgent,
  847. but may also surprise users if the user also switched on the
  848. monitor! */
  849. buf_lru_switched_on_innodb_mon = false;
  850. srv_print_innodb_monitor = FALSE;
  851. }
  852. }
  853. /******************************************************************//**
  854. Returns a free block from the buf_pool. The block is taken off the
  855. free list. If free list is empty, blocks are moved from the end of the
  856. LRU list to the free list.
  857. This function is called from a user thread when it needs a clean
  858. block to read in a page. Note that we only ever get a block from
  859. the free list. Even when we flush a page or find a page in LRU scan
  860. we put it to free list to be used.
  861. * iteration 0:
  862. * get a block from free list, success:done
  863. * if buf_pool->try_LRU_scan is set
  864. * scan LRU up to srv_LRU_scan_depth to find a clean block
  865. * the above will put the block on free list
  866. * success:retry the free list
  867. * flush one dirty page from tail of LRU to disk
  868. * the above will put the block on free list
  869. * success: retry the free list
  870. * iteration 1:
  871. * same as iteration 0 except:
  872. * scan whole LRU list
  873. * scan LRU list even if buf_pool->try_LRU_scan is not set
  874. * iteration > 1:
  875. * same as iteration 1 but sleep 10ms
  876. @return the free control block, in state BUF_BLOCK_READY_FOR_USE */
  877. buf_block_t*
  878. buf_LRU_get_free_block(
  879. /*===================*/
  880. buf_pool_t* buf_pool) /*!< in/out: buffer pool instance */
  881. {
  882. buf_block_t* block = NULL;
  883. bool freed = false;
  884. ulint n_iterations = 0;
  885. ulint flush_failures = 0;
  886. MONITOR_INC(MONITOR_LRU_GET_FREE_SEARCH);
  887. loop:
  888. buf_pool_mutex_enter(buf_pool);
  889. buf_LRU_check_size_of_non_data_objects(buf_pool);
  890. DBUG_EXECUTE_IF("ib_lru_force_no_free_page",
  891. if (!buf_lru_free_blocks_error_printed) {
  892. n_iterations = 21;
  893. goto not_found;});
  894. /* If there is a block in the free list, take it */
  895. block = buf_LRU_get_free_only(buf_pool);
  896. if (block != NULL) {
  897. buf_pool_mutex_exit(buf_pool);
  898. ut_ad(buf_pool_from_block(block) == buf_pool);
  899. memset(&block->page.zip, 0, sizeof block->page.zip);
  900. block->skip_flush_check = false;
  901. block->page.flush_observer = NULL;
  902. return(block);
  903. }
  904. MONITOR_INC( MONITOR_LRU_GET_FREE_LOOPS );
  905. freed = false;
  906. if (buf_pool->try_LRU_scan || n_iterations > 0) {
  907. /* If no block was in the free list, search from the
  908. end of the LRU list and try to free a block there.
  909. If we are doing for the first time we'll scan only
  910. tail of the LRU list otherwise we scan the whole LRU
  911. list. */
  912. freed = buf_LRU_scan_and_free_block(
  913. buf_pool, n_iterations > 0);
  914. if (!freed && n_iterations == 0) {
  915. /* Tell other threads that there is no point
  916. in scanning the LRU list. This flag is set to
  917. TRUE again when we flush a batch from this
  918. buffer pool. */
  919. buf_pool->try_LRU_scan = FALSE;
  920. /* Also tell the page_cleaner thread that
  921. there is work for it to do. */
  922. os_event_set(buf_flush_event);
  923. }
  924. }
  925. #ifndef DBUG_OFF
  926. not_found:
  927. #endif
  928. buf_pool_mutex_exit(buf_pool);
  929. if (freed) {
  930. goto loop;
  931. }
  932. if (n_iterations > 20 && !buf_lru_free_blocks_error_printed
  933. && srv_buf_pool_old_size == srv_buf_pool_size) {
  934. ib::warn() << "Difficult to find free blocks in the buffer pool"
  935. " (" << n_iterations << " search iterations)! "
  936. << flush_failures << " failed attempts to"
  937. " flush a page!"
  938. " Consider increasing innodb_buffer_pool_size."
  939. " Pending flushes (fsync) log: "
  940. << fil_n_pending_log_flushes
  941. << "; buffer pool: "
  942. << fil_n_pending_tablespace_flushes
  943. << ". " << os_n_file_reads << " OS file reads, "
  944. << os_n_file_writes << " OS file writes, "
  945. << os_n_fsyncs
  946. << " OS fsyncs.";
  947. buf_lru_free_blocks_error_printed = true;
  948. }
  949. /* If we have scanned the whole LRU and still are unable to
  950. find a free block then we should sleep here to let the
  951. page_cleaner do an LRU batch for us. */
  952. if (!srv_read_only_mode) {
  953. os_event_set(buf_flush_event);
  954. }
  955. if (n_iterations > 1) {
  956. MONITOR_INC( MONITOR_LRU_GET_FREE_WAITS );
  957. os_thread_sleep(10000);
  958. }
  959. /* No free block was found: try to flush the LRU list.
  960. This call will flush one page from the LRU and put it on the
  961. free list. That means that the free block is up for grabs for
  962. all user threads.
  963. TODO: A more elegant way would have been to return the freed
  964. up block to the caller here but the code that deals with
  965. removing the block from page_hash and LRU_list is fairly
  966. involved (particularly in case of compressed pages). We
  967. can do that in a separate patch sometime in future. */
  968. if (!buf_flush_single_page_from_LRU(buf_pool)) {
  969. MONITOR_INC(MONITOR_LRU_SINGLE_FLUSH_FAILURE_COUNT);
  970. ++flush_failures;
  971. }
  972. srv_stats.buf_pool_wait_free.inc();
  973. n_iterations++;
  974. goto loop;
  975. }
  976. /*******************************************************************//**
  977. Moves the LRU_old pointer so that the length of the old blocks list
  978. is inside the allowed limits. */
  979. UNIV_INLINE
  980. void
  981. buf_LRU_old_adjust_len(
  982. /*===================*/
  983. buf_pool_t* buf_pool) /*!< in: buffer pool instance */
  984. {
  985. ulint old_len;
  986. ulint new_len;
  987. ut_a(buf_pool->LRU_old);
  988. ut_ad(buf_pool_mutex_own(buf_pool));
  989. ut_ad(buf_pool->LRU_old_ratio >= BUF_LRU_OLD_RATIO_MIN);
  990. ut_ad(buf_pool->LRU_old_ratio <= BUF_LRU_OLD_RATIO_MAX);
  991. compile_time_assert(BUF_LRU_OLD_RATIO_MIN * BUF_LRU_OLD_MIN_LEN
  992. > BUF_LRU_OLD_RATIO_DIV
  993. * (BUF_LRU_OLD_TOLERANCE + 5));
  994. compile_time_assert(BUF_LRU_NON_OLD_MIN_LEN < BUF_LRU_OLD_MIN_LEN);
  995. #ifdef UNIV_LRU_DEBUG
  996. /* buf_pool->LRU_old must be the first item in the LRU list
  997. whose "old" flag is set. */
  998. ut_a(buf_pool->LRU_old->old);
  999. ut_a(!UT_LIST_GET_PREV(LRU, buf_pool->LRU_old)
  1000. || !UT_LIST_GET_PREV(LRU, buf_pool->LRU_old)->old);
  1001. ut_a(!UT_LIST_GET_NEXT(LRU, buf_pool->LRU_old)
  1002. || UT_LIST_GET_NEXT(LRU, buf_pool->LRU_old)->old);
  1003. #endif /* UNIV_LRU_DEBUG */
  1004. old_len = buf_pool->LRU_old_len;
  1005. new_len = ut_min(UT_LIST_GET_LEN(buf_pool->LRU)
  1006. * buf_pool->LRU_old_ratio / BUF_LRU_OLD_RATIO_DIV,
  1007. UT_LIST_GET_LEN(buf_pool->LRU)
  1008. - (BUF_LRU_OLD_TOLERANCE
  1009. + BUF_LRU_NON_OLD_MIN_LEN));
  1010. for (;;) {
  1011. buf_page_t* LRU_old = buf_pool->LRU_old;
  1012. ut_a(LRU_old);
  1013. ut_ad(LRU_old->in_LRU_list);
  1014. #ifdef UNIV_LRU_DEBUG
  1015. ut_a(LRU_old->old);
  1016. #endif /* UNIV_LRU_DEBUG */
  1017. /* Update the LRU_old pointer if necessary */
  1018. if (old_len + BUF_LRU_OLD_TOLERANCE < new_len) {
  1019. buf_pool->LRU_old = LRU_old = UT_LIST_GET_PREV(
  1020. LRU, LRU_old);
  1021. #ifdef UNIV_LRU_DEBUG
  1022. ut_a(!LRU_old->old);
  1023. #endif /* UNIV_LRU_DEBUG */
  1024. old_len = ++buf_pool->LRU_old_len;
  1025. buf_page_set_old(LRU_old, TRUE);
  1026. } else if (old_len > new_len + BUF_LRU_OLD_TOLERANCE) {
  1027. buf_pool->LRU_old = UT_LIST_GET_NEXT(LRU, LRU_old);
  1028. old_len = --buf_pool->LRU_old_len;
  1029. buf_page_set_old(LRU_old, FALSE);
  1030. } else {
  1031. return;
  1032. }
  1033. }
  1034. }
  1035. /*******************************************************************//**
  1036. Initializes the old blocks pointer in the LRU list. This function should be
  1037. called when the LRU list grows to BUF_LRU_OLD_MIN_LEN length. */
  1038. static
  1039. void
  1040. buf_LRU_old_init(
  1041. /*=============*/
  1042. buf_pool_t* buf_pool)
  1043. {
  1044. ut_ad(buf_pool_mutex_own(buf_pool));
  1045. ut_a(UT_LIST_GET_LEN(buf_pool->LRU) == BUF_LRU_OLD_MIN_LEN);
  1046. /* We first initialize all blocks in the LRU list as old and then use
  1047. the adjust function to move the LRU_old pointer to the right
  1048. position */
  1049. for (buf_page_t* bpage = UT_LIST_GET_LAST(buf_pool->LRU);
  1050. bpage != NULL;
  1051. bpage = UT_LIST_GET_PREV(LRU, bpage)) {
  1052. ut_ad(bpage->in_LRU_list);
  1053. ut_ad(buf_page_in_file(bpage));
  1054. /* This loop temporarily violates the
  1055. assertions of buf_page_set_old(). */
  1056. bpage->old = TRUE;
  1057. }
  1058. buf_pool->LRU_old = UT_LIST_GET_FIRST(buf_pool->LRU);
  1059. buf_pool->LRU_old_len = UT_LIST_GET_LEN(buf_pool->LRU);
  1060. buf_LRU_old_adjust_len(buf_pool);
  1061. }
  1062. /******************************************************************//**
  1063. Remove a block from the unzip_LRU list if it belonged to the list. */
  1064. static
  1065. void
  1066. buf_unzip_LRU_remove_block_if_needed(
  1067. /*=================================*/
  1068. buf_page_t* bpage) /*!< in/out: control block */
  1069. {
  1070. buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
  1071. ut_ad(buf_page_in_file(bpage));
  1072. ut_ad(buf_pool_mutex_own(buf_pool));
  1073. if (buf_page_belongs_to_unzip_LRU(bpage)) {
  1074. buf_block_t* block = reinterpret_cast<buf_block_t*>(bpage);
  1075. ut_ad(block->in_unzip_LRU_list);
  1076. ut_d(block->in_unzip_LRU_list = FALSE);
  1077. UT_LIST_REMOVE(buf_pool->unzip_LRU, block);
  1078. }
  1079. }
  1080. /******************************************************************//**
  1081. Adjust LRU hazard pointers if needed. */
  1082. void
  1083. buf_LRU_adjust_hp(
  1084. /*==============*/
  1085. buf_pool_t* buf_pool,/*!< in: buffer pool instance */
  1086. const buf_page_t* bpage) /*!< in: control block */
  1087. {
  1088. buf_pool->lru_hp.adjust(bpage);
  1089. buf_pool->lru_scan_itr.adjust(bpage);
  1090. buf_pool->single_scan_itr.adjust(bpage);
  1091. }
  1092. /******************************************************************//**
  1093. Removes a block from the LRU list. */
  1094. UNIV_INLINE
  1095. void
  1096. buf_LRU_remove_block(
  1097. /*=================*/
  1098. buf_page_t* bpage) /*!< in: control block */
  1099. {
  1100. buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
  1101. ut_ad(buf_pool_mutex_own(buf_pool));
  1102. ut_a(buf_page_in_file(bpage));
  1103. ut_ad(bpage->in_LRU_list);
  1104. /* Important that we adjust the hazard pointers before removing
  1105. bpage from the LRU list. */
  1106. buf_LRU_adjust_hp(buf_pool, bpage);
  1107. /* If the LRU_old pointer is defined and points to just this block,
  1108. move it backward one step */
  1109. if (bpage == buf_pool->LRU_old) {
  1110. /* Below: the previous block is guaranteed to exist,
  1111. because the LRU_old pointer is only allowed to differ
  1112. by BUF_LRU_OLD_TOLERANCE from strict
  1113. buf_pool->LRU_old_ratio/BUF_LRU_OLD_RATIO_DIV of the LRU
  1114. list length. */
  1115. buf_page_t* prev_bpage = UT_LIST_GET_PREV(LRU, bpage);
  1116. ut_a(prev_bpage);
  1117. #ifdef UNIV_LRU_DEBUG
  1118. ut_a(!prev_bpage->old);
  1119. #endif /* UNIV_LRU_DEBUG */
  1120. buf_pool->LRU_old = prev_bpage;
  1121. buf_page_set_old(prev_bpage, TRUE);
  1122. buf_pool->LRU_old_len++;
  1123. }
  1124. /* Remove the block from the LRU list */
  1125. UT_LIST_REMOVE(buf_pool->LRU, bpage);
  1126. ut_d(bpage->in_LRU_list = FALSE);
  1127. buf_pool->stat.LRU_bytes -= bpage->physical_size();
  1128. buf_unzip_LRU_remove_block_if_needed(bpage);
  1129. /* If the LRU list is so short that LRU_old is not defined,
  1130. clear the "old" flags and return */
  1131. if (UT_LIST_GET_LEN(buf_pool->LRU) < BUF_LRU_OLD_MIN_LEN) {
  1132. for (buf_page_t* bpage = UT_LIST_GET_FIRST(buf_pool->LRU);
  1133. bpage != NULL;
  1134. bpage = UT_LIST_GET_NEXT(LRU, bpage)) {
  1135. /* This loop temporarily violates the
  1136. assertions of buf_page_set_old(). */
  1137. bpage->old = FALSE;
  1138. }
  1139. buf_pool->LRU_old = NULL;
  1140. buf_pool->LRU_old_len = 0;
  1141. return;
  1142. }
  1143. ut_ad(buf_pool->LRU_old);
  1144. /* Update the LRU_old_len field if necessary */
  1145. if (buf_page_is_old(bpage)) {
  1146. buf_pool->LRU_old_len--;
  1147. }
  1148. /* Adjust the length of the old block list if necessary */
  1149. buf_LRU_old_adjust_len(buf_pool);
  1150. }
  1151. /******************************************************************//**
  1152. Adds a block to the LRU list of decompressed zip pages. */
  1153. void
  1154. buf_unzip_LRU_add_block(
  1155. /*====================*/
  1156. buf_block_t* block, /*!< in: control block */
  1157. ibool old) /*!< in: TRUE if should be put to the end
  1158. of the list, else put to the start */
  1159. {
  1160. buf_pool_t* buf_pool = buf_pool_from_block(block);
  1161. ut_ad(buf_pool_mutex_own(buf_pool));
  1162. ut_a(buf_page_belongs_to_unzip_LRU(&block->page));
  1163. ut_ad(!block->in_unzip_LRU_list);
  1164. ut_d(block->in_unzip_LRU_list = TRUE);
  1165. if (old) {
  1166. UT_LIST_ADD_LAST(buf_pool->unzip_LRU, block);
  1167. } else {
  1168. UT_LIST_ADD_FIRST(buf_pool->unzip_LRU, block);
  1169. }
  1170. }
  1171. /******************************************************************//**
  1172. Adds a block to the LRU list. Please make sure that the page_size is
  1173. already set when invoking the function, so that we can get correct
  1174. page_size from the buffer page when adding a block into LRU */
  1175. UNIV_INLINE
  1176. void
  1177. buf_LRU_add_block_low(
  1178. /*==================*/
  1179. buf_page_t* bpage, /*!< in: control block */
  1180. ibool old) /*!< in: TRUE if should be put to the old blocks
  1181. in the LRU list, else put to the start; if the
  1182. LRU list is very short, the block is added to
  1183. the start, regardless of this parameter */
  1184. {
  1185. buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
  1186. ut_ad(buf_pool_mutex_own(buf_pool));
  1187. ut_a(buf_page_in_file(bpage));
  1188. ut_ad(!bpage->in_LRU_list);
  1189. if (!old || (UT_LIST_GET_LEN(buf_pool->LRU) < BUF_LRU_OLD_MIN_LEN)) {
  1190. UT_LIST_ADD_FIRST(buf_pool->LRU, bpage);
  1191. bpage->freed_page_clock = buf_pool->freed_page_clock;
  1192. } else {
  1193. #ifdef UNIV_LRU_DEBUG
  1194. /* buf_pool->LRU_old must be the first item in the LRU list
  1195. whose "old" flag is set. */
  1196. ut_a(buf_pool->LRU_old->old);
  1197. ut_a(!UT_LIST_GET_PREV(LRU, buf_pool->LRU_old)
  1198. || !UT_LIST_GET_PREV(LRU, buf_pool->LRU_old)->old);
  1199. ut_a(!UT_LIST_GET_NEXT(LRU, buf_pool->LRU_old)
  1200. || UT_LIST_GET_NEXT(LRU, buf_pool->LRU_old)->old);
  1201. #endif /* UNIV_LRU_DEBUG */
  1202. UT_LIST_INSERT_AFTER(buf_pool->LRU, buf_pool->LRU_old,
  1203. bpage);
  1204. buf_pool->LRU_old_len++;
  1205. }
  1206. ut_d(bpage->in_LRU_list = TRUE);
  1207. incr_LRU_size_in_bytes(bpage, buf_pool);
  1208. if (UT_LIST_GET_LEN(buf_pool->LRU) > BUF_LRU_OLD_MIN_LEN) {
  1209. ut_ad(buf_pool->LRU_old);
  1210. /* Adjust the length of the old block list if necessary */
  1211. buf_page_set_old(bpage, old);
  1212. buf_LRU_old_adjust_len(buf_pool);
  1213. } else if (UT_LIST_GET_LEN(buf_pool->LRU) == BUF_LRU_OLD_MIN_LEN) {
  1214. /* The LRU list is now long enough for LRU_old to become
  1215. defined: init it */
  1216. buf_LRU_old_init(buf_pool);
  1217. } else {
  1218. buf_page_set_old(bpage, buf_pool->LRU_old != NULL);
  1219. }
  1220. /* If this is a zipped block with decompressed frame as well
  1221. then put it on the unzip_LRU list */
  1222. if (buf_page_belongs_to_unzip_LRU(bpage)) {
  1223. buf_unzip_LRU_add_block((buf_block_t*) bpage, old);
  1224. }
  1225. }
  1226. /******************************************************************//**
  1227. Adds a block to the LRU list. Please make sure that the page_size is
  1228. already set when invoking the function, so that we can get correct
  1229. page_size from the buffer page when adding a block into LRU */
  1230. void
  1231. buf_LRU_add_block(
  1232. /*==============*/
  1233. buf_page_t* bpage, /*!< in: control block */
  1234. ibool old) /*!< in: TRUE if should be put to the old
  1235. blocks in the LRU list, else put to the start;
  1236. if the LRU list is very short, the block is
  1237. added to the start, regardless of this
  1238. parameter */
  1239. {
  1240. buf_LRU_add_block_low(bpage, old);
  1241. }
  1242. /******************************************************************//**
  1243. Moves a block to the start of the LRU list. */
  1244. void
  1245. buf_LRU_make_block_young(
  1246. /*=====================*/
  1247. buf_page_t* bpage) /*!< in: control block */
  1248. {
  1249. buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
  1250. ut_ad(buf_pool_mutex_own(buf_pool));
  1251. if (bpage->old) {
  1252. buf_pool->stat.n_pages_made_young++;
  1253. }
  1254. buf_LRU_remove_block(bpage);
  1255. buf_LRU_add_block_low(bpage, FALSE);
  1256. }
  1257. /******************************************************************//**
  1258. Try to free a block. If bpage is a descriptor of a compressed-only
  1259. page, the descriptor object will be freed as well.
  1260. NOTE: If this function returns true, it will temporarily
  1261. release buf_pool->mutex. Furthermore, the page frame will no longer be
  1262. accessible via bpage.
  1263. The caller must hold buf_pool->mutex and must not hold any
  1264. buf_page_get_mutex() when calling this function.
  1265. @return true if freed, false otherwise. */
  1266. bool
  1267. buf_LRU_free_page(
  1268. /*===============*/
  1269. buf_page_t* bpage, /*!< in: block to be freed */
  1270. bool zip) /*!< in: true if should remove also the
  1271. compressed page of an uncompressed page */
  1272. {
  1273. buf_page_t* b = NULL;
  1274. buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
  1275. rw_lock_t* hash_lock = buf_page_hash_lock_get(buf_pool, bpage->id);
  1276. BPageMutex* block_mutex = buf_page_get_mutex(bpage);
  1277. ut_ad(buf_pool_mutex_own(buf_pool));
  1278. ut_ad(buf_page_in_file(bpage));
  1279. ut_ad(bpage->in_LRU_list);
  1280. rw_lock_x_lock(hash_lock);
  1281. mutex_enter(block_mutex);
  1282. if (!buf_page_can_relocate(bpage)) {
  1283. /* Do not free buffer fixed and I/O-fixed blocks. */
  1284. goto func_exit;
  1285. }
  1286. if (zip || !bpage->zip.data) {
  1287. /* This would completely free the block. */
  1288. /* Do not completely free dirty blocks. */
  1289. if (bpage->oldest_modification) {
  1290. goto func_exit;
  1291. }
  1292. } else if (bpage->oldest_modification > 0
  1293. && buf_page_get_state(bpage) != BUF_BLOCK_FILE_PAGE) {
  1294. ut_ad(buf_page_get_state(bpage) == BUF_BLOCK_ZIP_DIRTY);
  1295. func_exit:
  1296. rw_lock_x_unlock(hash_lock);
  1297. mutex_exit(block_mutex);
  1298. return(false);
  1299. } else if (buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE) {
  1300. b = buf_page_alloc_descriptor();
  1301. ut_a(b);
  1302. new (b) buf_page_t(*bpage);
  1303. }
  1304. ut_ad(buf_pool_mutex_own(buf_pool));
  1305. ut_ad(buf_page_in_file(bpage));
  1306. ut_ad(bpage->in_LRU_list);
  1307. ut_ad(!bpage->in_flush_list == !bpage->oldest_modification);
  1308. DBUG_PRINT("ib_buf", ("free page %u:%u",
  1309. bpage->id.space(), bpage->id.page_no()));
  1310. ut_ad(rw_lock_own(hash_lock, RW_LOCK_X));
  1311. ut_ad(buf_page_can_relocate(bpage));
  1312. if (!buf_LRU_block_remove_hashed(bpage, zip)) {
  1313. return(true);
  1314. }
  1315. /* buf_LRU_block_remove_hashed() releases the hash_lock */
  1316. ut_ad(!rw_lock_own_flagged(hash_lock,
  1317. RW_LOCK_FLAG_X | RW_LOCK_FLAG_S));
  1318. /* We have just freed a BUF_BLOCK_FILE_PAGE. If b != NULL
  1319. then it was a compressed page with an uncompressed frame and
  1320. we are interested in freeing only the uncompressed frame.
  1321. Therefore we have to reinsert the compressed page descriptor
  1322. into the LRU and page_hash (and possibly flush_list).
  1323. if b == NULL then it was a regular page that has been freed */
  1324. if (b != NULL) {
  1325. buf_page_t* prev_b = UT_LIST_GET_PREV(LRU, b);
  1326. rw_lock_x_lock(hash_lock);
  1327. mutex_enter(block_mutex);
  1328. ut_a(!buf_page_hash_get_low(buf_pool, b->id));
  1329. b->state = b->oldest_modification
  1330. ? BUF_BLOCK_ZIP_DIRTY
  1331. : BUF_BLOCK_ZIP_PAGE;
  1332. ut_ad(b->zip_size());
  1333. UNIV_MEM_DESC(b->zip.data, b->zip_size());
  1334. /* The fields in_page_hash and in_LRU_list of
  1335. the to-be-freed block descriptor should have
  1336. been cleared in
  1337. buf_LRU_block_remove_hashed(), which
  1338. invokes buf_LRU_remove_block(). */
  1339. ut_ad(!bpage->in_page_hash);
  1340. ut_ad(!bpage->in_LRU_list);
  1341. /* bpage->state was BUF_BLOCK_FILE_PAGE because
  1342. b != NULL. The type cast below is thus valid. */
  1343. ut_ad(!((buf_block_t*) bpage)->in_unzip_LRU_list);
  1344. /* The fields of bpage were copied to b before
  1345. buf_LRU_block_remove_hashed() was invoked. */
  1346. ut_ad(!b->in_zip_hash);
  1347. ut_ad(b->in_page_hash);
  1348. ut_ad(b->in_LRU_list);
  1349. HASH_INSERT(buf_page_t, hash, buf_pool->page_hash,
  1350. b->id.fold(), b);
  1351. /* Insert b where bpage was in the LRU list. */
  1352. if (prev_b != NULL) {
  1353. ulint lru_len;
  1354. ut_ad(prev_b->in_LRU_list);
  1355. ut_ad(buf_page_in_file(prev_b));
  1356. UT_LIST_INSERT_AFTER(buf_pool->LRU, prev_b, b);
  1357. incr_LRU_size_in_bytes(b, buf_pool);
  1358. if (buf_page_is_old(b)) {
  1359. buf_pool->LRU_old_len++;
  1360. if (buf_pool->LRU_old
  1361. == UT_LIST_GET_NEXT(LRU, b)) {
  1362. buf_pool->LRU_old = b;
  1363. }
  1364. }
  1365. lru_len = UT_LIST_GET_LEN(buf_pool->LRU);
  1366. if (lru_len > BUF_LRU_OLD_MIN_LEN) {
  1367. ut_ad(buf_pool->LRU_old);
  1368. /* Adjust the length of the
  1369. old block list if necessary */
  1370. buf_LRU_old_adjust_len(buf_pool);
  1371. } else if (lru_len == BUF_LRU_OLD_MIN_LEN) {
  1372. /* The LRU list is now long
  1373. enough for LRU_old to become
  1374. defined: init it */
  1375. buf_LRU_old_init(buf_pool);
  1376. }
  1377. #ifdef UNIV_LRU_DEBUG
  1378. /* Check that the "old" flag is consistent
  1379. in the block and its neighbours. */
  1380. buf_page_set_old(b, buf_page_is_old(b));
  1381. #endif /* UNIV_LRU_DEBUG */
  1382. } else {
  1383. ut_d(b->in_LRU_list = FALSE);
  1384. buf_LRU_add_block_low(b, buf_page_is_old(b));
  1385. }
  1386. if (b->state == BUF_BLOCK_ZIP_PAGE) {
  1387. #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
  1388. buf_LRU_insert_zip_clean(b);
  1389. #endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */
  1390. } else {
  1391. /* Relocate on buf_pool->flush_list. */
  1392. buf_flush_relocate_on_flush_list(bpage, b);
  1393. }
  1394. bpage->zip.data = NULL;
  1395. page_zip_set_size(&bpage->zip, 0);
  1396. mutex_exit(block_mutex);
  1397. /* Prevent buf_page_get_gen() from
  1398. decompressing the block while we release
  1399. buf_pool->mutex and block_mutex. */
  1400. block_mutex = buf_page_get_mutex(b);
  1401. mutex_enter(block_mutex);
  1402. buf_page_set_sticky(b);
  1403. mutex_exit(block_mutex);
  1404. rw_lock_x_unlock(hash_lock);
  1405. }
  1406. buf_pool_mutex_exit(buf_pool);
  1407. /* Remove possible adaptive hash index on the page.
  1408. The page was declared uninitialized by
  1409. buf_LRU_block_remove_hashed(). We need to flag
  1410. the contents of the page valid (which it still is) in
  1411. order to avoid bogus Valgrind warnings.*/
  1412. UNIV_MEM_VALID(((buf_block_t*) bpage)->frame,
  1413. srv_page_size);
  1414. btr_search_drop_page_hash_index((buf_block_t*) bpage);
  1415. UNIV_MEM_INVALID(((buf_block_t*) bpage)->frame,
  1416. srv_page_size);
  1417. if (b != NULL) {
  1418. /* Compute and stamp the compressed page
  1419. checksum while not holding any mutex. The
  1420. block is already half-freed
  1421. (BUF_BLOCK_REMOVE_HASH) and removed from
  1422. buf_pool->page_hash, thus inaccessible by any
  1423. other thread. */
  1424. ut_ad(b->zip_size());
  1425. const uint32_t checksum = page_zip_calc_checksum(
  1426. b->zip.data,
  1427. b->zip_size(),
  1428. static_cast<srv_checksum_algorithm_t>(
  1429. srv_checksum_algorithm));
  1430. mach_write_to_4(b->zip.data + FIL_PAGE_SPACE_OR_CHKSUM,
  1431. checksum);
  1432. }
  1433. buf_pool_mutex_enter(buf_pool);
  1434. if (b != NULL) {
  1435. mutex_enter(block_mutex);
  1436. buf_page_unset_sticky(b);
  1437. mutex_exit(block_mutex);
  1438. }
  1439. buf_LRU_block_free_hashed_page((buf_block_t*) bpage);
  1440. return(true);
  1441. }
  1442. /******************************************************************//**
  1443. Puts a block back to the free list. */
  1444. void
  1445. buf_LRU_block_free_non_file_page(
  1446. /*=============================*/
  1447. buf_block_t* block) /*!< in: block, must not contain a file page */
  1448. {
  1449. void* data;
  1450. buf_pool_t* buf_pool = buf_pool_from_block(block);
  1451. ut_ad(buf_pool_mutex_own(buf_pool));
  1452. ut_ad(buf_page_mutex_own(block));
  1453. switch (buf_block_get_state(block)) {
  1454. case BUF_BLOCK_MEMORY:
  1455. case BUF_BLOCK_READY_FOR_USE:
  1456. break;
  1457. default:
  1458. ut_error;
  1459. }
  1460. assert_block_ahi_empty(block);
  1461. ut_ad(!block->page.in_free_list);
  1462. ut_ad(!block->page.in_flush_list);
  1463. ut_ad(!block->page.in_LRU_list);
  1464. buf_block_set_state(block, BUF_BLOCK_NOT_USED);
  1465. UNIV_MEM_ALLOC(block->frame, srv_page_size);
  1466. #ifdef UNIV_DEBUG
  1467. /* Wipe contents of page to reveal possible stale pointers to it */
  1468. memset(block->frame, '\0', srv_page_size);
  1469. #else
  1470. /* Wipe page_no and space_id */
  1471. memset(block->frame + FIL_PAGE_OFFSET, 0xfe, 4);
  1472. memset(block->frame + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID, 0xfe, 4);
  1473. #endif /* UNIV_DEBUG */
  1474. data = block->page.zip.data;
  1475. if (data != NULL) {
  1476. block->page.zip.data = NULL;
  1477. buf_page_mutex_exit(block);
  1478. buf_pool_mutex_exit_forbid(buf_pool);
  1479. ut_ad(block->zip_size());
  1480. buf_buddy_free(buf_pool, data, block->zip_size());
  1481. buf_pool_mutex_exit_allow(buf_pool);
  1482. buf_page_mutex_enter(block);
  1483. page_zip_set_size(&block->page.zip, 0);
  1484. }
  1485. if (buf_pool->curr_size < buf_pool->old_size
  1486. && UT_LIST_GET_LEN(buf_pool->withdraw) < buf_pool->withdraw_target
  1487. && buf_block_will_withdrawn(buf_pool, block)) {
  1488. /* This should be withdrawn */
  1489. UT_LIST_ADD_LAST(
  1490. buf_pool->withdraw,
  1491. &block->page);
  1492. ut_d(block->in_withdraw_list = TRUE);
  1493. } else {
  1494. UT_LIST_ADD_FIRST(buf_pool->free, &block->page);
  1495. ut_d(block->page.in_free_list = TRUE);
  1496. }
  1497. UNIV_MEM_FREE(block->frame, srv_page_size);
  1498. }
  1499. /******************************************************************//**
  1500. Takes a block out of the LRU list and page hash table.
  1501. If the block is compressed-only (BUF_BLOCK_ZIP_PAGE),
  1502. the object will be freed.
  1503. The caller must hold buf_pool->mutex, the buf_page_get_mutex() mutex
  1504. and the appropriate hash_lock. This function will release the
  1505. buf_page_get_mutex() and the hash_lock.
  1506. If a compressed page is freed other compressed pages may be relocated.
  1507. @retval true if BUF_BLOCK_FILE_PAGE was removed from page_hash. The
  1508. caller needs to free the page to the free list
  1509. @retval false if BUF_BLOCK_ZIP_PAGE was removed from page_hash. In
  1510. this case the block is already returned to the buddy allocator. */
  1511. static
  1512. bool
  1513. buf_LRU_block_remove_hashed(
  1514. /*========================*/
  1515. buf_page_t* bpage, /*!< in: block, must contain a file page and
  1516. be in a state where it can be freed; there
  1517. may or may not be a hash index to the page */
  1518. bool zip) /*!< in: true if should remove also the
  1519. compressed page of an uncompressed page */
  1520. {
  1521. const buf_page_t* hashed_bpage;
  1522. buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
  1523. rw_lock_t* hash_lock;
  1524. ut_ad(buf_pool_mutex_own(buf_pool));
  1525. ut_ad(mutex_own(buf_page_get_mutex(bpage)));
  1526. hash_lock = buf_page_hash_lock_get(buf_pool, bpage->id);
  1527. ut_ad(rw_lock_own(hash_lock, RW_LOCK_X));
  1528. ut_a(buf_page_get_io_fix(bpage) == BUF_IO_NONE);
  1529. ut_a(bpage->buf_fix_count == 0);
  1530. buf_LRU_remove_block(bpage);
  1531. buf_pool->freed_page_clock += 1;
  1532. switch (buf_page_get_state(bpage)) {
  1533. case BUF_BLOCK_FILE_PAGE:
  1534. UNIV_MEM_ASSERT_W(bpage, sizeof(buf_block_t));
  1535. UNIV_MEM_ASSERT_W(((buf_block_t*) bpage)->frame,
  1536. srv_page_size);
  1537. buf_block_modify_clock_inc((buf_block_t*) bpage);
  1538. if (bpage->zip.data) {
  1539. const page_t* page = ((buf_block_t*) bpage)->frame;
  1540. ut_a(!zip || bpage->oldest_modification == 0);
  1541. ut_ad(bpage->zip_size());
  1542. switch (fil_page_get_type(page)) {
  1543. case FIL_PAGE_TYPE_ALLOCATED:
  1544. case FIL_PAGE_INODE:
  1545. case FIL_PAGE_IBUF_BITMAP:
  1546. case FIL_PAGE_TYPE_FSP_HDR:
  1547. case FIL_PAGE_TYPE_XDES:
  1548. /* These are essentially uncompressed pages. */
  1549. if (!zip) {
  1550. /* InnoDB writes the data to the
  1551. uncompressed page frame. Copy it
  1552. to the compressed page, which will
  1553. be preserved. */
  1554. memcpy(bpage->zip.data, page,
  1555. bpage->zip_size());
  1556. }
  1557. break;
  1558. case FIL_PAGE_TYPE_ZBLOB:
  1559. case FIL_PAGE_TYPE_ZBLOB2:
  1560. break;
  1561. case FIL_PAGE_INDEX:
  1562. case FIL_PAGE_RTREE:
  1563. #if defined UNIV_ZIP_DEBUG && defined BTR_CUR_HASH_ADAPT
  1564. ut_a(page_zip_validate(
  1565. &bpage->zip, page,
  1566. ((buf_block_t*) bpage)->index));
  1567. #endif /* UNIV_ZIP_DEBUG && BTR_CUR_HASH_ADAPT */
  1568. break;
  1569. default:
  1570. ib::error() << "The compressed page to be"
  1571. " evicted seems corrupt:";
  1572. ut_print_buf(stderr, page, srv_page_size);
  1573. ib::error() << "Possibly older version of"
  1574. " the page:";
  1575. ut_print_buf(stderr, bpage->zip.data,
  1576. bpage->zip_size());
  1577. putc('\n', stderr);
  1578. ut_error;
  1579. }
  1580. break;
  1581. }
  1582. /* fall through */
  1583. case BUF_BLOCK_ZIP_PAGE:
  1584. ut_a(bpage->oldest_modification == 0);
  1585. UNIV_MEM_ASSERT_W(bpage->zip.data, bpage->zip_size());
  1586. break;
  1587. case BUF_BLOCK_POOL_WATCH:
  1588. case BUF_BLOCK_ZIP_DIRTY:
  1589. case BUF_BLOCK_NOT_USED:
  1590. case BUF_BLOCK_READY_FOR_USE:
  1591. case BUF_BLOCK_MEMORY:
  1592. case BUF_BLOCK_REMOVE_HASH:
  1593. ut_error;
  1594. break;
  1595. }
  1596. hashed_bpage = buf_page_hash_get_low(buf_pool, bpage->id);
  1597. if (bpage != hashed_bpage) {
  1598. ib::error() << "Page " << bpage->id
  1599. << " not found in the hash table";
  1600. ib::error()
  1601. #ifdef UNIV_DEBUG
  1602. << "in_page_hash:" << bpage->in_page_hash
  1603. << " in_zip_hash:" << bpage->in_zip_hash
  1604. << " in_flush_list:" << bpage->in_flush_list
  1605. << " in_LRU_list:" << bpage->in_LRU_list
  1606. #endif
  1607. << " zip.data:" << bpage->zip.data
  1608. << " zip_size:" << bpage->zip_size()
  1609. << " page_state:" << buf_page_get_state(bpage);
  1610. if (hashed_bpage) {
  1611. ib::error() << "In hash table we find block "
  1612. << hashed_bpage << " of " << hashed_bpage->id
  1613. << " which is not " << bpage;
  1614. }
  1615. #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
  1616. mutex_exit(buf_page_get_mutex(bpage));
  1617. rw_lock_x_unlock(hash_lock);
  1618. buf_pool_mutex_exit(buf_pool);
  1619. buf_print();
  1620. buf_LRU_print();
  1621. buf_validate();
  1622. buf_LRU_validate();
  1623. #endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */
  1624. ut_error;
  1625. }
  1626. ut_ad(!bpage->in_zip_hash);
  1627. ut_ad(bpage->in_page_hash);
  1628. ut_d(bpage->in_page_hash = FALSE);
  1629. HASH_DELETE(buf_page_t, hash, buf_pool->page_hash, bpage->id.fold(),
  1630. bpage);
  1631. switch (buf_page_get_state(bpage)) {
  1632. case BUF_BLOCK_ZIP_PAGE:
  1633. ut_ad(!bpage->in_free_list);
  1634. ut_ad(!bpage->in_flush_list);
  1635. ut_ad(!bpage->in_LRU_list);
  1636. ut_a(bpage->zip.data);
  1637. ut_a(bpage->zip.ssize);
  1638. #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
  1639. UT_LIST_REMOVE(buf_pool->zip_clean, bpage);
  1640. #endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */
  1641. mutex_exit(&buf_pool->zip_mutex);
  1642. rw_lock_x_unlock(hash_lock);
  1643. buf_pool_mutex_exit_forbid(buf_pool);
  1644. buf_buddy_free(buf_pool, bpage->zip.data, bpage->zip_size());
  1645. buf_pool_mutex_exit_allow(buf_pool);
  1646. buf_page_free_descriptor(bpage);
  1647. return(false);
  1648. case BUF_BLOCK_FILE_PAGE:
  1649. memset(((buf_block_t*) bpage)->frame
  1650. + FIL_PAGE_OFFSET, 0xff, 4);
  1651. memset(((buf_block_t*) bpage)->frame
  1652. + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID, 0xff, 4);
  1653. UNIV_MEM_INVALID(((buf_block_t*) bpage)->frame,
  1654. srv_page_size);
  1655. buf_page_set_state(bpage, BUF_BLOCK_REMOVE_HASH);
  1656. /* Question: If we release bpage and hash mutex here
  1657. then what protects us against:
  1658. 1) Some other thread buffer fixing this page
  1659. 2) Some other thread trying to read this page and
  1660. not finding it in buffer pool attempting to read it
  1661. from the disk.
  1662. Answer:
  1663. 1) Cannot happen because the page is no longer in the
  1664. page_hash. Only possibility is when while invalidating
  1665. a tablespace we buffer fix the prev_page in LRU to
  1666. avoid relocation during the scan. But that is not
  1667. possible because we are holding buf_pool mutex.
  1668. 2) Not possible because in buf_page_init_for_read()
  1669. we do a look up of page_hash while holding buf_pool
  1670. mutex and since we are holding buf_pool mutex here
  1671. and by the time we'll release it in the caller we'd
  1672. have inserted the compressed only descriptor in the
  1673. page_hash. */
  1674. rw_lock_x_unlock(hash_lock);
  1675. mutex_exit(&((buf_block_t*) bpage)->mutex);
  1676. if (zip && bpage->zip.data) {
  1677. /* Free the compressed page. */
  1678. void* data = bpage->zip.data;
  1679. bpage->zip.data = NULL;
  1680. ut_ad(!bpage->in_free_list);
  1681. ut_ad(!bpage->in_flush_list);
  1682. ut_ad(!bpage->in_LRU_list);
  1683. buf_pool_mutex_exit_forbid(buf_pool);
  1684. buf_buddy_free(buf_pool, data, bpage->zip_size());
  1685. buf_pool_mutex_exit_allow(buf_pool);
  1686. page_zip_set_size(&bpage->zip, 0);
  1687. }
  1688. return(true);
  1689. case BUF_BLOCK_POOL_WATCH:
  1690. case BUF_BLOCK_ZIP_DIRTY:
  1691. case BUF_BLOCK_NOT_USED:
  1692. case BUF_BLOCK_READY_FOR_USE:
  1693. case BUF_BLOCK_MEMORY:
  1694. case BUF_BLOCK_REMOVE_HASH:
  1695. break;
  1696. }
  1697. ut_error;
  1698. return(false);
  1699. }
  1700. /******************************************************************//**
  1701. Puts a file page whose has no hash index to the free list. */
  1702. static
  1703. void
  1704. buf_LRU_block_free_hashed_page(
  1705. /*===========================*/
  1706. buf_block_t* block) /*!< in: block, must contain a file page and
  1707. be in a state where it can be freed */
  1708. {
  1709. buf_pool_t* buf_pool = buf_pool_from_block(block);
  1710. ut_ad(buf_pool_mutex_own(buf_pool));
  1711. buf_page_mutex_enter(block);
  1712. if (buf_pool->flush_rbt == NULL) {
  1713. block->page.id
  1714. = page_id_t(ULINT32_UNDEFINED, ULINT32_UNDEFINED);
  1715. }
  1716. buf_block_set_state(block, BUF_BLOCK_MEMORY);
  1717. buf_LRU_block_free_non_file_page(block);
  1718. buf_page_mutex_exit(block);
  1719. }
  1720. /** Remove one page from LRU list and put it to free list.
  1721. @param[in,out] bpage block, must contain a file page and be in
  1722. a freeable state; there may or may not be a
  1723. hash index to the page
  1724. @param[in] old_page_id page number before bpage->id was invalidated */
  1725. void buf_LRU_free_one_page(buf_page_t* bpage, page_id_t old_page_id)
  1726. {
  1727. buf_pool_t* buf_pool = buf_pool_from_bpage(bpage);
  1728. rw_lock_t* hash_lock = buf_page_hash_lock_get(buf_pool,
  1729. old_page_id);
  1730. BPageMutex* block_mutex = buf_page_get_mutex(bpage);
  1731. ut_ad(buf_pool_mutex_own(buf_pool));
  1732. rw_lock_x_lock(hash_lock);
  1733. while (bpage->buf_fix_count > 0) {
  1734. /* Wait for other threads to release the fix count
  1735. before releasing the bpage from LRU list. */
  1736. }
  1737. mutex_enter(block_mutex);
  1738. bpage->id = old_page_id;
  1739. if (buf_LRU_block_remove_hashed(bpage, true)) {
  1740. buf_LRU_block_free_hashed_page((buf_block_t*) bpage);
  1741. }
  1742. /* buf_LRU_block_remove_hashed() releases hash_lock and block_mutex */
  1743. ut_ad(!rw_lock_own_flagged(hash_lock,
  1744. RW_LOCK_FLAG_X | RW_LOCK_FLAG_S));
  1745. ut_ad(!mutex_own(block_mutex));
  1746. }
  1747. /**********************************************************************//**
  1748. Updates buf_pool->LRU_old_ratio for one buffer pool instance.
  1749. @return updated old_pct */
  1750. static
  1751. uint
  1752. buf_LRU_old_ratio_update_instance(
  1753. /*==============================*/
  1754. buf_pool_t* buf_pool,/*!< in: buffer pool instance */
  1755. uint old_pct,/*!< in: Reserve this percentage of
  1756. the buffer pool for "old" blocks. */
  1757. bool adjust) /*!< in: true=adjust the LRU list;
  1758. false=just assign buf_pool->LRU_old_ratio
  1759. during the initialization of InnoDB */
  1760. {
  1761. uint ratio;
  1762. ratio = old_pct * BUF_LRU_OLD_RATIO_DIV / 100;
  1763. if (ratio < BUF_LRU_OLD_RATIO_MIN) {
  1764. ratio = BUF_LRU_OLD_RATIO_MIN;
  1765. } else if (ratio > BUF_LRU_OLD_RATIO_MAX) {
  1766. ratio = BUF_LRU_OLD_RATIO_MAX;
  1767. }
  1768. if (adjust) {
  1769. buf_pool_mutex_enter(buf_pool);
  1770. if (ratio != buf_pool->LRU_old_ratio) {
  1771. buf_pool->LRU_old_ratio = ratio;
  1772. if (UT_LIST_GET_LEN(buf_pool->LRU)
  1773. >= BUF_LRU_OLD_MIN_LEN) {
  1774. buf_LRU_old_adjust_len(buf_pool);
  1775. }
  1776. }
  1777. buf_pool_mutex_exit(buf_pool);
  1778. } else {
  1779. buf_pool->LRU_old_ratio = ratio;
  1780. }
  1781. /* the reverse of
  1782. ratio = old_pct * BUF_LRU_OLD_RATIO_DIV / 100 */
  1783. return((uint) (ratio * 100 / (double) BUF_LRU_OLD_RATIO_DIV + 0.5));
  1784. }
  1785. /**********************************************************************//**
  1786. Updates buf_pool->LRU_old_ratio.
  1787. @return updated old_pct */
  1788. uint
  1789. buf_LRU_old_ratio_update(
  1790. /*=====================*/
  1791. uint old_pct,/*!< in: Reserve this percentage of
  1792. the buffer pool for "old" blocks. */
  1793. bool adjust) /*!< in: true=adjust the LRU list;
  1794. false=just assign buf_pool->LRU_old_ratio
  1795. during the initialization of InnoDB */
  1796. {
  1797. uint new_ratio = 0;
  1798. for (ulint i = 0; i < srv_buf_pool_instances; i++) {
  1799. buf_pool_t* buf_pool;
  1800. buf_pool = buf_pool_from_array(i);
  1801. new_ratio = buf_LRU_old_ratio_update_instance(
  1802. buf_pool, old_pct, adjust);
  1803. }
  1804. return(new_ratio);
  1805. }
  1806. /********************************************************************//**
  1807. Update the historical stats that we are collecting for LRU eviction
  1808. policy at the end of each interval. */
  1809. void
  1810. buf_LRU_stat_update(void)
  1811. /*=====================*/
  1812. {
  1813. buf_LRU_stat_t* item;
  1814. buf_pool_t* buf_pool;
  1815. bool evict_started = FALSE;
  1816. buf_LRU_stat_t cur_stat;
  1817. /* If we haven't started eviction yet then don't update stats. */
  1818. for (ulint i = 0; i < srv_buf_pool_instances; i++) {
  1819. buf_pool = buf_pool_from_array(i);
  1820. if (buf_pool->freed_page_clock != 0) {
  1821. evict_started = true;
  1822. break;
  1823. }
  1824. }
  1825. if (!evict_started) {
  1826. goto func_exit;
  1827. }
  1828. /* Update the index. */
  1829. item = &buf_LRU_stat_arr[buf_LRU_stat_arr_ind];
  1830. buf_LRU_stat_arr_ind++;
  1831. buf_LRU_stat_arr_ind %= BUF_LRU_STAT_N_INTERVAL;
  1832. /* Add the current value and subtract the obsolete entry.
  1833. Since buf_LRU_stat_cur is not protected by any mutex,
  1834. it can be changing between adding to buf_LRU_stat_sum
  1835. and copying to item. Assign it to local variables to make
  1836. sure the same value assign to the buf_LRU_stat_sum
  1837. and item */
  1838. cur_stat = buf_LRU_stat_cur;
  1839. buf_LRU_stat_sum.io += cur_stat.io - item->io;
  1840. buf_LRU_stat_sum.unzip += cur_stat.unzip - item->unzip;
  1841. /* Put current entry in the array. */
  1842. memcpy(item, &cur_stat, sizeof *item);
  1843. func_exit:
  1844. /* Clear the current entry. */
  1845. memset(&buf_LRU_stat_cur, 0, sizeof buf_LRU_stat_cur);
  1846. }
  1847. #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
  1848. /**********************************************************************//**
  1849. Validates the LRU list for one buffer pool instance. */
  1850. static
  1851. void
  1852. buf_LRU_validate_instance(
  1853. /*======================*/
  1854. buf_pool_t* buf_pool)
  1855. {
  1856. ulint old_len;
  1857. ulint new_len;
  1858. buf_pool_mutex_enter(buf_pool);
  1859. if (UT_LIST_GET_LEN(buf_pool->LRU) >= BUF_LRU_OLD_MIN_LEN) {
  1860. ut_a(buf_pool->LRU_old);
  1861. old_len = buf_pool->LRU_old_len;
  1862. new_len = ut_min(UT_LIST_GET_LEN(buf_pool->LRU)
  1863. * buf_pool->LRU_old_ratio
  1864. / BUF_LRU_OLD_RATIO_DIV,
  1865. UT_LIST_GET_LEN(buf_pool->LRU)
  1866. - (BUF_LRU_OLD_TOLERANCE
  1867. + BUF_LRU_NON_OLD_MIN_LEN));
  1868. ut_a(old_len >= new_len - BUF_LRU_OLD_TOLERANCE);
  1869. ut_a(old_len <= new_len + BUF_LRU_OLD_TOLERANCE);
  1870. }
  1871. CheckInLRUList::validate(buf_pool);
  1872. old_len = 0;
  1873. for (buf_page_t* bpage = UT_LIST_GET_FIRST(buf_pool->LRU);
  1874. bpage != NULL;
  1875. bpage = UT_LIST_GET_NEXT(LRU, bpage)) {
  1876. switch (buf_page_get_state(bpage)) {
  1877. case BUF_BLOCK_POOL_WATCH:
  1878. case BUF_BLOCK_NOT_USED:
  1879. case BUF_BLOCK_READY_FOR_USE:
  1880. case BUF_BLOCK_MEMORY:
  1881. case BUF_BLOCK_REMOVE_HASH:
  1882. ut_error;
  1883. break;
  1884. case BUF_BLOCK_FILE_PAGE:
  1885. ut_ad(((buf_block_t*) bpage)->in_unzip_LRU_list
  1886. == buf_page_belongs_to_unzip_LRU(bpage));
  1887. case BUF_BLOCK_ZIP_PAGE:
  1888. case BUF_BLOCK_ZIP_DIRTY:
  1889. break;
  1890. }
  1891. if (buf_page_is_old(bpage)) {
  1892. const buf_page_t* prev
  1893. = UT_LIST_GET_PREV(LRU, bpage);
  1894. const buf_page_t* next
  1895. = UT_LIST_GET_NEXT(LRU, bpage);
  1896. if (!old_len++) {
  1897. ut_a(buf_pool->LRU_old == bpage);
  1898. } else {
  1899. ut_a(!prev || buf_page_is_old(prev));
  1900. }
  1901. ut_a(!next || buf_page_is_old(next));
  1902. }
  1903. }
  1904. ut_a(buf_pool->LRU_old_len == old_len);
  1905. CheckInFreeList::validate(buf_pool);
  1906. for (buf_page_t* bpage = UT_LIST_GET_FIRST(buf_pool->free);
  1907. bpage != NULL;
  1908. bpage = UT_LIST_GET_NEXT(list, bpage)) {
  1909. ut_a(buf_page_get_state(bpage) == BUF_BLOCK_NOT_USED);
  1910. }
  1911. CheckUnzipLRUAndLRUList::validate(buf_pool);
  1912. for (buf_block_t* block = UT_LIST_GET_FIRST(buf_pool->unzip_LRU);
  1913. block != NULL;
  1914. block = UT_LIST_GET_NEXT(unzip_LRU, block)) {
  1915. ut_ad(block->in_unzip_LRU_list);
  1916. ut_ad(block->page.in_LRU_list);
  1917. ut_a(buf_page_belongs_to_unzip_LRU(&block->page));
  1918. }
  1919. buf_pool_mutex_exit(buf_pool);
  1920. }
  1921. /**********************************************************************//**
  1922. Validates the LRU list.
  1923. @return TRUE */
  1924. ibool
  1925. buf_LRU_validate(void)
  1926. /*==================*/
  1927. {
  1928. for (ulint i = 0; i < srv_buf_pool_instances; i++) {
  1929. buf_pool_t* buf_pool;
  1930. buf_pool = buf_pool_from_array(i);
  1931. buf_LRU_validate_instance(buf_pool);
  1932. }
  1933. return(TRUE);
  1934. }
  1935. #endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */
  1936. #if defined UNIV_DEBUG_PRINT || defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
  1937. /**********************************************************************//**
  1938. Prints the LRU list for one buffer pool instance. */
  1939. static
  1940. void
  1941. buf_LRU_print_instance(
  1942. /*===================*/
  1943. buf_pool_t* buf_pool)
  1944. {
  1945. buf_pool_mutex_enter(buf_pool);
  1946. for (const buf_page_t* bpage = UT_LIST_GET_FIRST(buf_pool->LRU);
  1947. bpage != NULL;
  1948. bpage = UT_LIST_GET_NEXT(LRU, bpage)) {
  1949. mutex_enter(buf_page_get_mutex(bpage));
  1950. fprintf(stderr, "BLOCK space %u page %u ",
  1951. bpage->id.space(), bpage->id.page_no());
  1952. if (buf_page_is_old(bpage)) {
  1953. fputs("old ", stderr);
  1954. }
  1955. if (bpage->buf_fix_count) {
  1956. fprintf(stderr, "buffix count %u ",
  1957. uint32_t(bpage->buf_fix_count));
  1958. }
  1959. if (buf_page_get_io_fix(bpage)) {
  1960. fprintf(stderr, "io_fix %d ",
  1961. buf_page_get_io_fix(bpage));
  1962. }
  1963. if (bpage->oldest_modification) {
  1964. fputs("modif. ", stderr);
  1965. }
  1966. switch (buf_page_get_state(bpage)) {
  1967. const byte* frame;
  1968. case BUF_BLOCK_FILE_PAGE:
  1969. frame = buf_block_get_frame((buf_block_t*) bpage);
  1970. fprintf(stderr, "\ntype %u index id " IB_ID_FMT "\n",
  1971. fil_page_get_type(frame),
  1972. btr_page_get_index_id(frame));
  1973. break;
  1974. case BUF_BLOCK_ZIP_PAGE:
  1975. frame = bpage->zip.data;
  1976. fprintf(stderr, "\ntype %u size " ULINTPF
  1977. " index id " IB_ID_FMT "\n",
  1978. fil_page_get_type(frame),
  1979. bpage->zip_size(),
  1980. btr_page_get_index_id(frame));
  1981. break;
  1982. default:
  1983. fprintf(stderr, "\n!state %d!\n",
  1984. buf_page_get_state(bpage));
  1985. break;
  1986. }
  1987. mutex_exit(buf_page_get_mutex(bpage));
  1988. }
  1989. buf_pool_mutex_exit(buf_pool);
  1990. }
  1991. /**********************************************************************//**
  1992. Prints the LRU list. */
  1993. void
  1994. buf_LRU_print(void)
  1995. /*===============*/
  1996. {
  1997. for (ulint i = 0; i < srv_buf_pool_instances; i++) {
  1998. buf_pool_t* buf_pool;
  1999. buf_pool = buf_pool_from_array(i);
  2000. buf_LRU_print_instance(buf_pool);
  2001. }
  2002. }
  2003. #endif /* UNIV_DEBUG_PRINT || UNIV_DEBUG || UNIV_BUF_DEBUG */