You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

4156 lines
105 KiB

MDEV-17794 Do not assign persistent ID for temporary tables InnoDB in MySQL 5.7 introduced two new parameters to the function dict_hdr_get_new_id(), to allow redo logging to be disabled when assigning identifiers to temporary tables or during the backup-unfriendly TRUNCATE TABLE that was replaced in MariaDB by MDEV-13564. Now that MariaDB 10.4.0 removed the crash recovery code for the backup-unfriendly TRUNCATE, we can revert dict_hdr_get_new_id() to be used only for persistent data structures. dict_table_assign_new_id(): Remove. This was a simple 2-line function that was called from few places. dict_table_open_on_id_low(): Declare in the only file where it is called. dict_sys_t::temp_id_hash: A separate lookup table for temporary tables. Table names will be in the common dict_sys_t::table_hash. dict_sys_t::get_temporary_table_id(): Assign a temporary table ID. dict_sys_t::get_table(): Look up a persistent table. dict_sys_t::get_temporary_table(): Look up a temporary table. dict_sys_t::temp_table_id: The sequence of temporary table identifiers. Starts from DICT_HDR_FIRST_ID, so that we can continue to simply compare dict_table_t::id to a few constants for the persistent hard-coded data dictionary tables. undo_node_t::state: Distinguish temporary and persistent tables. lock_check_dict_lock(), lock_get_table_id(): Assert that there cannot be locks on temporary tables. row_rec_to_index_entry_impl(): Assert that there cannot be metadata records on temporary tables. row_undo_ins_parse_undo_rec(): Distinguish temporary and persistent tables. Move some assertions from the only caller. Return whether the table was found. row_undo_ins(): Add some assertions. row_undo_mod_clust(), row_undo_mod(): Do not assign node->state. Let row_undo() do that. row_undo_mod_parse_undo_rec(): Distinguish temporary and persistent tables. Move some assertions from the only caller. Return whether the table was found. row_undo_try_truncate(): Renamed and simplified from trx_roll_try_truncate(). row_undo_rec_get(): Replaces trx_roll_pop_top_rec_of_trx() and trx_roll_pop_top_rec(). Fetch an undo log record, and assign undo->state accordingly. trx_undo_truncate_end(): Acquire the rseg->mutex only for the minimum required duration, and release it between mini-transactions.
7 years ago
MDEV-14407 Assertion failure during rollback Rollback attempted to dereference DB_ROLL_PTR=0, which cannot possibly be a valid undo log pointer. A safer canonical value would be roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS which is what was chosen in MDEV-12288, corresponding to reset_trx_id. No deterministic test case for the bug was found. The simplest test cases may be related to MDEV-11415, which suppresses undo logging for ALGORITHM=COPY operations. In those operations, in the spirit of MDEV-12288, we should actually have written reset_trx_id instead of using the transaction identifier of the current transaction (and a bogus value of DB_ROLL_PTR=0). However, thanks to MySQL Bug#28432 which I had fixed in MySQL 5.6.8 as part of WL#6255, access to the rebuilt table by earlier-started transactions should actually have been refused with ER_TABLE_DEF_CHANGED. reset_trx_id: Move the definition to data0type.cc and the declaration to data0type.h. btr_cur_ins_lock_and_undo(): When undo logging is disabled, use the safe value that corresponds to reset_trx_id. btr_cur_optimistic_insert(): Validate the DB_TRX_ID,DB_ROLL_PTR before inserting into a clustered index leaf page. ins_node_t::sys_buf[]: Replaces row_id_buf and trx_id_buf and some heap usage. row_ins_alloc_sys_fields(): Init ins_node_t::sys_buf[] to reset_trx_id. row_ins_buf(): Only if undo logging is enabled, copy trx->id to node->sys_buf. Otherwise, rely on the initialization in row_ins_alloc_sys_fields(). row_purge_reset_trx_id(): Invoke mlog_write_string() with reset_trx_id directly. (No functional change.) trx_undo_page_report_modify(): Assert that the DB_ROLL_PTR is not 0. trx_undo_get_undo_rec_low(): Assert that the roll_ptr is valid before trying to dereference it. dict_index_t::is_primary(): Check if the index is the primary key. PageConverter::adjust_cluster_record(): Fix MDEV-15249 Crash in MVCC read after IMPORT TABLESPACE by resetting the system fields to reset_trx_id instead of writing the current transaction ID (which will be committed at the end of the IMPORT TABLESPACE) and DB_ROLL_PTR=0. This can partially be viewed as a follow-up fix of MDEV-12288, because IMPORT should already then have written DB_TRX_ID=0 and DB_ROLL_PTR=1<<55 to prevent unnecessary DB_TRX_ID lookups in subsequent accesses to the table.
8 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-11623 MariaDB 10.1 fails to start datadir created with MariaDB 10.0/MySQL 5.6 using innodb-page-size!=16K The storage format of FSP_SPACE_FLAGS was accidentally broken already in MariaDB 10.1.0. This fix is bringing the format in line with other MySQL and MariaDB release series. Please refer to the comments that were added to fsp0fsp.h for details. This is an INCOMPATIBLE CHANGE that affects users of page_compression and non-default innodb_page_size. Upgrading to this release will correct the flags in the data files. If you want to downgrade to earlier MariaDB 10.1.x, please refer to the test innodb.101_compatibility how to reset the FSP_SPACE_FLAGS in the files. NOTE: MariaDB 10.1.0 to 10.1.20 can misinterpret uncompressed data files with innodb_page_size=4k or 64k as compressed innodb_page_size=16k files, and then probably fail when trying to access the pages. See the comments in the function fsp_flags_convert_from_101() for detailed analysis. Move PAGE_COMPRESSION to FSP_SPACE_FLAGS bit position 16. In this way, compressed innodb_page_size=16k tablespaces will not be mistaken for uncompressed ones by MariaDB 10.1.0 to 10.1.20. Derive PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES and DATA_DIR from the dict_table_t::flags when the table is available, in fil_space_for_table_exists_in_mem() or fil_open_single_table_tablespace(). During crash recovery, fil_load_single_table_tablespace() will use innodb_compression_level for the PAGE_COMPRESSION_LEVEL. FSP_FLAGS_MEM_MASK: A bitmap of the memory-only fil_space_t::flags that are not to be written to FSP_SPACE_FLAGS. Currently, these will include PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES and DATA_DIR. Introduce the macro FSP_FLAGS_PAGE_SSIZE(). We only support one innodb_page_size for the whole instance. When creating a dummy tablespace for the redo log, use fil_space_t::flags=0. The flags are never written to the redo log files. Remove many FSP_FLAGS_SET_ macros. dict_tf_verify_flags(): Remove. This is basically only duplicating the logic of dict_tf_to_fsp_flags(), used in a debug assertion. fil_space_t::mark: Remove. This flag was not used for anything. fil_space_for_table_exists_in_mem(): Remove the unnecessary parameter mark_space, and add a parameter for table flags. Check that fil_space_t::flags match the table flags, and adjust the (memory-only) flags based on the table flags. fil_node_open_file(): Remove some redundant or unreachable conditions, do not use stderr for output, and avoid unnecessary server aborts. fil_user_tablespace_restore_page(): Convert the flags, so that the correct page_size will be used when restoring a page from the doublewrite buffer. fil_space_get_page_compressed(), fsp_flags_is_page_compressed(): Remove. It suffices to have fil_space_is_page_compressed(). FSP_FLAGS_WIDTH_DATA_DIR, FSP_FLAGS_WIDTH_PAGE_COMPRESSION_LEVEL, FSP_FLAGS_WIDTH_ATOMIC_WRITES: Remove, because these flags do not exist in the FSP_SPACE_FLAGS but only in memory. fsp_flags_try_adjust(): New function, to adjust the FSP_SPACE_FLAGS in page 0. Called by fil_open_single_table_tablespace(), fil_space_for_table_exists_in_mem(), innobase_start_or_create_for_mysql() except if --innodb-read-only is active. fsp_flags_is_valid(ulint): Reimplement from the scratch, with accurate comments. Do not display any details of detected inconsistencies, because the output could be confusing when dealing with MariaDB 10.1.x data files. fsp_flags_convert_from_101(ulint): Convert flags from buggy MariaDB 10.1.x format, or return ULINT_UNDEFINED if the flags cannot be in MariaDB 10.1.x format. fsp_flags_match(): Check the flags when probing files. Implemented based on fsp_flags_is_valid() and fsp_flags_convert_from_101(). dict_check_tablespaces_and_store_max_id(): Do not access the page after committing the mini-transaction. IMPORT TABLESPACE fixes: AbstractCallback::init(): Convert the flags. FetchIndexRootPages::operator(): Check that the tablespace flags match the table flags. Do not attempt to convert tablespace flags to table flags, because the conversion would necessarily be lossy. PageConverter::update_header(): Write back the correct flags. This takes care of the flags in IMPORT TABLESPACE.
9 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-11623 MariaDB 10.1 fails to start datadir created with MariaDB 10.0/MySQL 5.6 using innodb-page-size!=16K The storage format of FSP_SPACE_FLAGS was accidentally broken already in MariaDB 10.1.0. This fix is bringing the format in line with other MySQL and MariaDB release series. Please refer to the comments that were added to fsp0fsp.h for details. This is an INCOMPATIBLE CHANGE that affects users of page_compression and non-default innodb_page_size. Upgrading to this release will correct the flags in the data files. If you want to downgrade to earlier MariaDB 10.1.x, please refer to the test innodb.101_compatibility how to reset the FSP_SPACE_FLAGS in the files. NOTE: MariaDB 10.1.0 to 10.1.20 can misinterpret uncompressed data files with innodb_page_size=4k or 64k as compressed innodb_page_size=16k files, and then probably fail when trying to access the pages. See the comments in the function fsp_flags_convert_from_101() for detailed analysis. Move PAGE_COMPRESSION to FSP_SPACE_FLAGS bit position 16. In this way, compressed innodb_page_size=16k tablespaces will not be mistaken for uncompressed ones by MariaDB 10.1.0 to 10.1.20. Derive PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES and DATA_DIR from the dict_table_t::flags when the table is available, in fil_space_for_table_exists_in_mem() or fil_open_single_table_tablespace(). During crash recovery, fil_load_single_table_tablespace() will use innodb_compression_level for the PAGE_COMPRESSION_LEVEL. FSP_FLAGS_MEM_MASK: A bitmap of the memory-only fil_space_t::flags that are not to be written to FSP_SPACE_FLAGS. Currently, these will include PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES and DATA_DIR. Introduce the macro FSP_FLAGS_PAGE_SSIZE(). We only support one innodb_page_size for the whole instance. When creating a dummy tablespace for the redo log, use fil_space_t::flags=0. The flags are never written to the redo log files. Remove many FSP_FLAGS_SET_ macros. dict_tf_verify_flags(): Remove. This is basically only duplicating the logic of dict_tf_to_fsp_flags(), used in a debug assertion. fil_space_t::mark: Remove. This flag was not used for anything. fil_space_for_table_exists_in_mem(): Remove the unnecessary parameter mark_space, and add a parameter for table flags. Check that fil_space_t::flags match the table flags, and adjust the (memory-only) flags based on the table flags. fil_node_open_file(): Remove some redundant or unreachable conditions, do not use stderr for output, and avoid unnecessary server aborts. fil_user_tablespace_restore_page(): Convert the flags, so that the correct page_size will be used when restoring a page from the doublewrite buffer. fil_space_get_page_compressed(), fsp_flags_is_page_compressed(): Remove. It suffices to have fil_space_is_page_compressed(). FSP_FLAGS_WIDTH_DATA_DIR, FSP_FLAGS_WIDTH_PAGE_COMPRESSION_LEVEL, FSP_FLAGS_WIDTH_ATOMIC_WRITES: Remove, because these flags do not exist in the FSP_SPACE_FLAGS but only in memory. fsp_flags_try_adjust(): New function, to adjust the FSP_SPACE_FLAGS in page 0. Called by fil_open_single_table_tablespace(), fil_space_for_table_exists_in_mem(), innobase_start_or_create_for_mysql() except if --innodb-read-only is active. fsp_flags_is_valid(ulint): Reimplement from the scratch, with accurate comments. Do not display any details of detected inconsistencies, because the output could be confusing when dealing with MariaDB 10.1.x data files. fsp_flags_convert_from_101(ulint): Convert flags from buggy MariaDB 10.1.x format, or return ULINT_UNDEFINED if the flags cannot be in MariaDB 10.1.x format. fsp_flags_match(): Check the flags when probing files. Implemented based on fsp_flags_is_valid() and fsp_flags_convert_from_101(). dict_check_tablespaces_and_store_max_id(): Do not access the page after committing the mini-transaction. IMPORT TABLESPACE fixes: AbstractCallback::init(): Convert the flags. FetchIndexRootPages::operator(): Check that the tablespace flags match the table flags. Do not attempt to convert tablespace flags to table flags, because the conversion would necessarily be lossy. PageConverter::update_header(): Write back the correct flags. This takes care of the flags in IMPORT TABLESPACE.
9 years ago
MDEV-11623 MariaDB 10.1 fails to start datadir created with MariaDB 10.0/MySQL 5.6 using innodb-page-size!=16K The storage format of FSP_SPACE_FLAGS was accidentally broken already in MariaDB 10.1.0. This fix is bringing the format in line with other MySQL and MariaDB release series. Please refer to the comments that were added to fsp0fsp.h for details. This is an INCOMPATIBLE CHANGE that affects users of page_compression and non-default innodb_page_size. Upgrading to this release will correct the flags in the data files. If you want to downgrade to earlier MariaDB 10.1.x, please refer to the test innodb.101_compatibility how to reset the FSP_SPACE_FLAGS in the files. NOTE: MariaDB 10.1.0 to 10.1.20 can misinterpret uncompressed data files with innodb_page_size=4k or 64k as compressed innodb_page_size=16k files, and then probably fail when trying to access the pages. See the comments in the function fsp_flags_convert_from_101() for detailed analysis. Move PAGE_COMPRESSION to FSP_SPACE_FLAGS bit position 16. In this way, compressed innodb_page_size=16k tablespaces will not be mistaken for uncompressed ones by MariaDB 10.1.0 to 10.1.20. Derive PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES and DATA_DIR from the dict_table_t::flags when the table is available, in fil_space_for_table_exists_in_mem() or fil_open_single_table_tablespace(). During crash recovery, fil_load_single_table_tablespace() will use innodb_compression_level for the PAGE_COMPRESSION_LEVEL. FSP_FLAGS_MEM_MASK: A bitmap of the memory-only fil_space_t::flags that are not to be written to FSP_SPACE_FLAGS. Currently, these will include PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES and DATA_DIR. Introduce the macro FSP_FLAGS_PAGE_SSIZE(). We only support one innodb_page_size for the whole instance. When creating a dummy tablespace for the redo log, use fil_space_t::flags=0. The flags are never written to the redo log files. Remove many FSP_FLAGS_SET_ macros. dict_tf_verify_flags(): Remove. This is basically only duplicating the logic of dict_tf_to_fsp_flags(), used in a debug assertion. fil_space_t::mark: Remove. This flag was not used for anything. fil_space_for_table_exists_in_mem(): Remove the unnecessary parameter mark_space, and add a parameter for table flags. Check that fil_space_t::flags match the table flags, and adjust the (memory-only) flags based on the table flags. fil_node_open_file(): Remove some redundant or unreachable conditions, do not use stderr for output, and avoid unnecessary server aborts. fil_user_tablespace_restore_page(): Convert the flags, so that the correct page_size will be used when restoring a page from the doublewrite buffer. fil_space_get_page_compressed(), fsp_flags_is_page_compressed(): Remove. It suffices to have fil_space_is_page_compressed(). FSP_FLAGS_WIDTH_DATA_DIR, FSP_FLAGS_WIDTH_PAGE_COMPRESSION_LEVEL, FSP_FLAGS_WIDTH_ATOMIC_WRITES: Remove, because these flags do not exist in the FSP_SPACE_FLAGS but only in memory. fsp_flags_try_adjust(): New function, to adjust the FSP_SPACE_FLAGS in page 0. Called by fil_open_single_table_tablespace(), fil_space_for_table_exists_in_mem(), innobase_start_or_create_for_mysql() except if --innodb-read-only is active. fsp_flags_is_valid(ulint): Reimplement from the scratch, with accurate comments. Do not display any details of detected inconsistencies, because the output could be confusing when dealing with MariaDB 10.1.x data files. fsp_flags_convert_from_101(ulint): Convert flags from buggy MariaDB 10.1.x format, or return ULINT_UNDEFINED if the flags cannot be in MariaDB 10.1.x format. fsp_flags_match(): Check the flags when probing files. Implemented based on fsp_flags_is_valid() and fsp_flags_convert_from_101(). dict_check_tablespaces_and_store_max_id(): Do not access the page after committing the mini-transaction. IMPORT TABLESPACE fixes: AbstractCallback::init(): Convert the flags. FetchIndexRootPages::operator(): Check that the tablespace flags match the table flags. Do not attempt to convert tablespace flags to table flags, because the conversion would necessarily be lossy. PageConverter::update_header(): Write back the correct flags. This takes care of the flags in IMPORT TABLESPACE.
9 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-11623 MariaDB 10.1 fails to start datadir created with MariaDB 10.0/MySQL 5.6 using innodb-page-size!=16K The storage format of FSP_SPACE_FLAGS was accidentally broken already in MariaDB 10.1.0. This fix is bringing the format in line with other MySQL and MariaDB release series. Please refer to the comments that were added to fsp0fsp.h for details. This is an INCOMPATIBLE CHANGE that affects users of page_compression and non-default innodb_page_size. Upgrading to this release will correct the flags in the data files. If you want to downgrade to earlier MariaDB 10.1.x, please refer to the test innodb.101_compatibility how to reset the FSP_SPACE_FLAGS in the files. NOTE: MariaDB 10.1.0 to 10.1.20 can misinterpret uncompressed data files with innodb_page_size=4k or 64k as compressed innodb_page_size=16k files, and then probably fail when trying to access the pages. See the comments in the function fsp_flags_convert_from_101() for detailed analysis. Move PAGE_COMPRESSION to FSP_SPACE_FLAGS bit position 16. In this way, compressed innodb_page_size=16k tablespaces will not be mistaken for uncompressed ones by MariaDB 10.1.0 to 10.1.20. Derive PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES and DATA_DIR from the dict_table_t::flags when the table is available, in fil_space_for_table_exists_in_mem() or fil_open_single_table_tablespace(). During crash recovery, fil_load_single_table_tablespace() will use innodb_compression_level for the PAGE_COMPRESSION_LEVEL. FSP_FLAGS_MEM_MASK: A bitmap of the memory-only fil_space_t::flags that are not to be written to FSP_SPACE_FLAGS. Currently, these will include PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES and DATA_DIR. Introduce the macro FSP_FLAGS_PAGE_SSIZE(). We only support one innodb_page_size for the whole instance. When creating a dummy tablespace for the redo log, use fil_space_t::flags=0. The flags are never written to the redo log files. Remove many FSP_FLAGS_SET_ macros. dict_tf_verify_flags(): Remove. This is basically only duplicating the logic of dict_tf_to_fsp_flags(), used in a debug assertion. fil_space_t::mark: Remove. This flag was not used for anything. fil_space_for_table_exists_in_mem(): Remove the unnecessary parameter mark_space, and add a parameter for table flags. Check that fil_space_t::flags match the table flags, and adjust the (memory-only) flags based on the table flags. fil_node_open_file(): Remove some redundant or unreachable conditions, do not use stderr for output, and avoid unnecessary server aborts. fil_user_tablespace_restore_page(): Convert the flags, so that the correct page_size will be used when restoring a page from the doublewrite buffer. fil_space_get_page_compressed(), fsp_flags_is_page_compressed(): Remove. It suffices to have fil_space_is_page_compressed(). FSP_FLAGS_WIDTH_DATA_DIR, FSP_FLAGS_WIDTH_PAGE_COMPRESSION_LEVEL, FSP_FLAGS_WIDTH_ATOMIC_WRITES: Remove, because these flags do not exist in the FSP_SPACE_FLAGS but only in memory. fsp_flags_try_adjust(): New function, to adjust the FSP_SPACE_FLAGS in page 0. Called by fil_open_single_table_tablespace(), fil_space_for_table_exists_in_mem(), innobase_start_or_create_for_mysql() except if --innodb-read-only is active. fsp_flags_is_valid(ulint): Reimplement from the scratch, with accurate comments. Do not display any details of detected inconsistencies, because the output could be confusing when dealing with MariaDB 10.1.x data files. fsp_flags_convert_from_101(ulint): Convert flags from buggy MariaDB 10.1.x format, or return ULINT_UNDEFINED if the flags cannot be in MariaDB 10.1.x format. fsp_flags_match(): Check the flags when probing files. Implemented based on fsp_flags_is_valid() and fsp_flags_convert_from_101(). dict_check_tablespaces_and_store_max_id(): Do not access the page after committing the mini-transaction. IMPORT TABLESPACE fixes: AbstractCallback::init(): Convert the flags. FetchIndexRootPages::operator(): Check that the tablespace flags match the table flags. Do not attempt to convert tablespace flags to table flags, because the conversion would necessarily be lossy. PageConverter::update_header(): Write back the correct flags. This takes care of the flags in IMPORT TABLESPACE.
9 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-14407 Assertion failure during rollback Rollback attempted to dereference DB_ROLL_PTR=0, which cannot possibly be a valid undo log pointer. A safer canonical value would be roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS which is what was chosen in MDEV-12288, corresponding to reset_trx_id. No deterministic test case for the bug was found. The simplest test cases may be related to MDEV-11415, which suppresses undo logging for ALGORITHM=COPY operations. In those operations, in the spirit of MDEV-12288, we should actually have written reset_trx_id instead of using the transaction identifier of the current transaction (and a bogus value of DB_ROLL_PTR=0). However, thanks to MySQL Bug#28432 which I had fixed in MySQL 5.6.8 as part of WL#6255, access to the rebuilt table by earlier-started transactions should actually have been refused with ER_TABLE_DEF_CHANGED. reset_trx_id: Move the definition to data0type.cc and the declaration to data0type.h. btr_cur_ins_lock_and_undo(): When undo logging is disabled, use the safe value that corresponds to reset_trx_id. btr_cur_optimistic_insert(): Validate the DB_TRX_ID,DB_ROLL_PTR before inserting into a clustered index leaf page. ins_node_t::sys_buf[]: Replaces row_id_buf and trx_id_buf and some heap usage. row_ins_alloc_sys_fields(): Init ins_node_t::sys_buf[] to reset_trx_id. row_ins_buf(): Only if undo logging is enabled, copy trx->id to node->sys_buf. Otherwise, rely on the initialization in row_ins_alloc_sys_fields(). row_purge_reset_trx_id(): Invoke mlog_write_string() with reset_trx_id directly. (No functional change.) trx_undo_page_report_modify(): Assert that the DB_ROLL_PTR is not 0. trx_undo_get_undo_rec_low(): Assert that the roll_ptr is valid before trying to dereference it. dict_index_t::is_primary(): Check if the index is the primary key. PageConverter::adjust_cluster_record(): Fix MDEV-15249 Crash in MVCC read after IMPORT TABLESPACE by resetting the system fields to reset_trx_id instead of writing the current transaction ID (which will be committed at the end of the IMPORT TABLESPACE) and DB_ROLL_PTR=0. This can partially be viewed as a follow-up fix of MDEV-12288, because IMPORT should already then have written DB_TRX_ID=0 and DB_ROLL_PTR=1<<55 to prevent unnecessary DB_TRX_ID lookups in subsequent accesses to the table.
8 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-14407 Assertion failure during rollback Rollback attempted to dereference DB_ROLL_PTR=0, which cannot possibly be a valid undo log pointer. A safer canonical value would be roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS which is what was chosen in MDEV-12288, corresponding to reset_trx_id. No deterministic test case for the bug was found. The simplest test cases may be related to MDEV-11415, which suppresses undo logging for ALGORITHM=COPY operations. In those operations, in the spirit of MDEV-12288, we should actually have written reset_trx_id instead of using the transaction identifier of the current transaction (and a bogus value of DB_ROLL_PTR=0). However, thanks to MySQL Bug#28432 which I had fixed in MySQL 5.6.8 as part of WL#6255, access to the rebuilt table by earlier-started transactions should actually have been refused with ER_TABLE_DEF_CHANGED. reset_trx_id: Move the definition to data0type.cc and the declaration to data0type.h. btr_cur_ins_lock_and_undo(): When undo logging is disabled, use the safe value that corresponds to reset_trx_id. btr_cur_optimistic_insert(): Validate the DB_TRX_ID,DB_ROLL_PTR before inserting into a clustered index leaf page. ins_node_t::sys_buf[]: Replaces row_id_buf and trx_id_buf and some heap usage. row_ins_alloc_sys_fields(): Init ins_node_t::sys_buf[] to reset_trx_id. row_ins_buf(): Only if undo logging is enabled, copy trx->id to node->sys_buf. Otherwise, rely on the initialization in row_ins_alloc_sys_fields(). row_purge_reset_trx_id(): Invoke mlog_write_string() with reset_trx_id directly. (No functional change.) trx_undo_page_report_modify(): Assert that the DB_ROLL_PTR is not 0. trx_undo_get_undo_rec_low(): Assert that the roll_ptr is valid before trying to dereference it. dict_index_t::is_primary(): Check if the index is the primary key. PageConverter::adjust_cluster_record(): Fix MDEV-15249 Crash in MVCC read after IMPORT TABLESPACE by resetting the system fields to reset_trx_id instead of writing the current transaction ID (which will be committed at the end of the IMPORT TABLESPACE) and DB_ROLL_PTR=0. This can partially be viewed as a follow-up fix of MDEV-12288, because IMPORT should already then have written DB_TRX_ID=0 and DB_ROLL_PTR=1<<55 to prevent unnecessary DB_TRX_ID lookups in subsequent accesses to the table.
8 years ago
MDEV-14407 Assertion failure during rollback Rollback attempted to dereference DB_ROLL_PTR=0, which cannot possibly be a valid undo log pointer. A safer canonical value would be roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS which is what was chosen in MDEV-12288, corresponding to reset_trx_id. No deterministic test case for the bug was found. The simplest test cases may be related to MDEV-11415, which suppresses undo logging for ALGORITHM=COPY operations. In those operations, in the spirit of MDEV-12288, we should actually have written reset_trx_id instead of using the transaction identifier of the current transaction (and a bogus value of DB_ROLL_PTR=0). However, thanks to MySQL Bug#28432 which I had fixed in MySQL 5.6.8 as part of WL#6255, access to the rebuilt table by earlier-started transactions should actually have been refused with ER_TABLE_DEF_CHANGED. reset_trx_id: Move the definition to data0type.cc and the declaration to data0type.h. btr_cur_ins_lock_and_undo(): When undo logging is disabled, use the safe value that corresponds to reset_trx_id. btr_cur_optimistic_insert(): Validate the DB_TRX_ID,DB_ROLL_PTR before inserting into a clustered index leaf page. ins_node_t::sys_buf[]: Replaces row_id_buf and trx_id_buf and some heap usage. row_ins_alloc_sys_fields(): Init ins_node_t::sys_buf[] to reset_trx_id. row_ins_buf(): Only if undo logging is enabled, copy trx->id to node->sys_buf. Otherwise, rely on the initialization in row_ins_alloc_sys_fields(). row_purge_reset_trx_id(): Invoke mlog_write_string() with reset_trx_id directly. (No functional change.) trx_undo_page_report_modify(): Assert that the DB_ROLL_PTR is not 0. trx_undo_get_undo_rec_low(): Assert that the roll_ptr is valid before trying to dereference it. dict_index_t::is_primary(): Check if the index is the primary key. PageConverter::adjust_cluster_record(): Fix MDEV-15249 Crash in MVCC read after IMPORT TABLESPACE by resetting the system fields to reset_trx_id instead of writing the current transaction ID (which will be committed at the end of the IMPORT TABLESPACE) and DB_ROLL_PTR=0. This can partially be viewed as a follow-up fix of MDEV-12288, because IMPORT should already then have written DB_TRX_ID=0 and DB_ROLL_PTR=1<<55 to prevent unnecessary DB_TRX_ID lookups in subsequent accesses to the table.
8 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-11623 MariaDB 10.1 fails to start datadir created with MariaDB 10.0/MySQL 5.6 using innodb-page-size!=16K The storage format of FSP_SPACE_FLAGS was accidentally broken already in MariaDB 10.1.0. This fix is bringing the format in line with other MySQL and MariaDB release series. Please refer to the comments that were added to fsp0fsp.h for details. This is an INCOMPATIBLE CHANGE that affects users of page_compression and non-default innodb_page_size. Upgrading to this release will correct the flags in the data files. If you want to downgrade to earlier MariaDB 10.1.x, please refer to the test innodb.101_compatibility how to reset the FSP_SPACE_FLAGS in the files. NOTE: MariaDB 10.1.0 to 10.1.20 can misinterpret uncompressed data files with innodb_page_size=4k or 64k as compressed innodb_page_size=16k files, and then probably fail when trying to access the pages. See the comments in the function fsp_flags_convert_from_101() for detailed analysis. Move PAGE_COMPRESSION to FSP_SPACE_FLAGS bit position 16. In this way, compressed innodb_page_size=16k tablespaces will not be mistaken for uncompressed ones by MariaDB 10.1.0 to 10.1.20. Derive PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES and DATA_DIR from the dict_table_t::flags when the table is available, in fil_space_for_table_exists_in_mem() or fil_open_single_table_tablespace(). During crash recovery, fil_load_single_table_tablespace() will use innodb_compression_level for the PAGE_COMPRESSION_LEVEL. FSP_FLAGS_MEM_MASK: A bitmap of the memory-only fil_space_t::flags that are not to be written to FSP_SPACE_FLAGS. Currently, these will include PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES and DATA_DIR. Introduce the macro FSP_FLAGS_PAGE_SSIZE(). We only support one innodb_page_size for the whole instance. When creating a dummy tablespace for the redo log, use fil_space_t::flags=0. The flags are never written to the redo log files. Remove many FSP_FLAGS_SET_ macros. dict_tf_verify_flags(): Remove. This is basically only duplicating the logic of dict_tf_to_fsp_flags(), used in a debug assertion. fil_space_t::mark: Remove. This flag was not used for anything. fil_space_for_table_exists_in_mem(): Remove the unnecessary parameter mark_space, and add a parameter for table flags. Check that fil_space_t::flags match the table flags, and adjust the (memory-only) flags based on the table flags. fil_node_open_file(): Remove some redundant or unreachable conditions, do not use stderr for output, and avoid unnecessary server aborts. fil_user_tablespace_restore_page(): Convert the flags, so that the correct page_size will be used when restoring a page from the doublewrite buffer. fil_space_get_page_compressed(), fsp_flags_is_page_compressed(): Remove. It suffices to have fil_space_is_page_compressed(). FSP_FLAGS_WIDTH_DATA_DIR, FSP_FLAGS_WIDTH_PAGE_COMPRESSION_LEVEL, FSP_FLAGS_WIDTH_ATOMIC_WRITES: Remove, because these flags do not exist in the FSP_SPACE_FLAGS but only in memory. fsp_flags_try_adjust(): New function, to adjust the FSP_SPACE_FLAGS in page 0. Called by fil_open_single_table_tablespace(), fil_space_for_table_exists_in_mem(), innobase_start_or_create_for_mysql() except if --innodb-read-only is active. fsp_flags_is_valid(ulint): Reimplement from the scratch, with accurate comments. Do not display any details of detected inconsistencies, because the output could be confusing when dealing with MariaDB 10.1.x data files. fsp_flags_convert_from_101(ulint): Convert flags from buggy MariaDB 10.1.x format, or return ULINT_UNDEFINED if the flags cannot be in MariaDB 10.1.x format. fsp_flags_match(): Check the flags when probing files. Implemented based on fsp_flags_is_valid() and fsp_flags_convert_from_101(). dict_check_tablespaces_and_store_max_id(): Do not access the page after committing the mini-transaction. IMPORT TABLESPACE fixes: AbstractCallback::init(): Convert the flags. FetchIndexRootPages::operator(): Check that the tablespace flags match the table flags. Do not attempt to convert tablespace flags to table flags, because the conversion would necessarily be lossy. PageConverter::update_header(): Write back the correct flags. This takes care of the flags in IMPORT TABLESPACE.
9 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13328 ALTER TABLE…DISCARD TABLESPACE takes a lot of time With a big buffer pool that contains many data pages, DISCARD TABLESPACE took a long time, because it would scan the entire buffer pool to remove any pages that belong to the tablespace. With a large buffer pool, this would take a lot of time, especially when the table-to-discard is empty. The minimum amount of work that DISCARD TABLESPACE must do is to remove the pages of the to-be-discarded table from the buf_pool->flush_list because any writes to the data file must be prevented before the file is deleted. If DISCARD TABLESPACE does not evict the pages from the buffer pool, then IMPORT TABLESPACE must do it, because we must prevent pre-DISCARD, not-yet-evicted pages from being mistaken for pages of the imported tablespace. It would not be a useful fix to simply move the buffer pool scan to the IMPORT TABLESPACE step. What we can do is to actively evict those pages that could be mistaken for imported pages. In this way, when importing a small table into a big buffer pool, the import should still run relatively fast. Import is bypassing the buffer pool when reading pages for the adjustment phase. In the adjustment phase, if a page exists in the buffer pool, we could replace it with the page from the imported file. Unfortunately I did not get this to work properly, so instead we will simply evict any matching page from the buffer pool. buf_page_get_gen(): Implement BUF_EVICT_IF_IN_POOL, a new mode where the requested page will be evicted if it is found. There must be no unwritten changes for the page. buf_remove_t: Remove. Instead, use trx!=NULL to signify that a write to file is desired, and use a separate parameter bool drop_ahi. buf_LRU_flush_or_remove_pages(), fil_delete_tablespace(): Replace buf_remove_t. buf_LRU_remove_pages(), buf_LRU_remove_all_pages(): Remove. PageConverter::m_mtr: A dummy mini-transaction buffer PageConverter::PageConverter(): Complete the member initialization list. PageConverter::operator()(): Evict any 'shadow' pages from the buffer pool so that pre-existing (garbage) pages cannot be mistaken for pages that exist in the being-imported file. row_discard_tablespace(): Remove a bogus comment that seems to refer to IMPORT TABLESPACE, not DISCARD TABLESPACE.
8 years ago
MDEV-13328 ALTER TABLE…DISCARD TABLESPACE takes a lot of time With a big buffer pool that contains many data pages, DISCARD TABLESPACE took a long time, because it would scan the entire buffer pool to remove any pages that belong to the tablespace. With a large buffer pool, this would take a lot of time, especially when the table-to-discard is empty. The minimum amount of work that DISCARD TABLESPACE must do is to remove the pages of the to-be-discarded table from the buf_pool->flush_list because any writes to the data file must be prevented before the file is deleted. If DISCARD TABLESPACE does not evict the pages from the buffer pool, then IMPORT TABLESPACE must do it, because we must prevent pre-DISCARD, not-yet-evicted pages from being mistaken for pages of the imported tablespace. It would not be a useful fix to simply move the buffer pool scan to the IMPORT TABLESPACE step. What we can do is to actively evict those pages that could be mistaken for imported pages. In this way, when importing a small table into a big buffer pool, the import should still run relatively fast. Import is bypassing the buffer pool when reading pages for the adjustment phase. In the adjustment phase, if a page exists in the buffer pool, we could replace it with the page from the imported file. Unfortunately I did not get this to work properly, so instead we will simply evict any matching page from the buffer pool. buf_page_get_gen(): Implement BUF_EVICT_IF_IN_POOL, a new mode where the requested page will be evicted if it is found. There must be no unwritten changes for the page. buf_remove_t: Remove. Instead, use trx!=NULL to signify that a write to file is desired, and use a separate parameter bool drop_ahi. buf_LRU_flush_or_remove_pages(), fil_delete_tablespace(): Replace buf_remove_t. buf_LRU_remove_pages(), buf_LRU_remove_all_pages(): Remove. PageConverter::m_mtr: A dummy mini-transaction buffer PageConverter::PageConverter(): Complete the member initialization list. PageConverter::operator()(): Evict any 'shadow' pages from the buffer pool so that pre-existing (garbage) pages cannot be mistaken for pages that exist in the being-imported file. row_discard_tablespace(): Remove a bogus comment that seems to refer to IMPORT TABLESPACE, not DISCARD TABLESPACE.
8 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
9 years ago
MDEV-12253: Buffer pool blocks are accessed after they have been freed Problem was that bpage was referenced after it was already freed from LRU. Fixed by adding a new variable encrypted that is passed down to buf_page_check_corrupt() and used in buf_page_get_gen() to stop processing page read. This patch should also address following test failures and bugs: MDEV-12419: IMPORT should not look up tablespace in PageConverter::validate(). This is now removed. MDEV-10099: encryption.innodb_onlinealter_encryption fails sporadically in buildbot MDEV-11420: encryption.innodb_encryption-page-compression failed in buildbot MDEV-11222: encryption.encrypt_and_grep failed in buildbot on P8 Removed dict_table_t::is_encrypted and dict_table_t::ibd_file_missing and replaced these with dict_table_t::file_unreadable. Table ibd file is missing if fil_get_space(space_id) returns NULL and encrypted if not. Removed dict_table_t::is_corrupted field. Ported FilSpace class from 10.2 and using that on buf_page_check_corrupt(), buf_page_decrypt_after_read(), buf_page_encrypt_before_write(), buf_dblwr_process(), buf_read_page(), dict_stats_save_defrag_stats(). Added test cases when enrypted page could be read while doing redo log crash recovery. Also added test case for row compressed blobs. btr_cur_open_at_index_side_func(), btr_cur_open_at_rnd_pos_func(): Avoid referencing block that is NULL. buf_page_get_zip(): Issue error if page read fails. buf_page_get_gen(): Use dberr_t for error detection and do not reference bpage after we hare freed it. buf_mark_space_corrupt(): remove bpage from LRU also when it is encrypted. buf_page_check_corrupt(): @return DB_SUCCESS if page has been read and is not corrupted, DB_PAGE_CORRUPTED if page based on checksum check is corrupted, DB_DECRYPTION_FAILED if page post encryption checksum matches but after decryption normal page checksum does not match. In read case only DB_SUCCESS is possible. buf_page_io_complete(): use dberr_t for error handling. buf_flush_write_block_low(), buf_read_ahead_random(), buf_read_page_async(), buf_read_ahead_linear(), buf_read_ibuf_merge_pages(), buf_read_recv_pages(), fil_aio_wait(): Issue error if page read fails. btr_pcur_move_to_next_page(): Do not reference page if it is NULL. Introduced dict_table_t::is_readable() and dict_index_t::is_readable() that will return true if tablespace exists and pages read from tablespace are not corrupted or page decryption failed. Removed buf_page_t::key_version. After page decryption the key version is not removed from page frame. For unencrypted pages, old key_version is removed at buf_page_encrypt_before_write() dict_stats_update_transient_for_index(), dict_stats_update_transient() Do not continue if table decryption failed or table is corrupted. dict0stats.cc: Introduced a dict_stats_report_error function to avoid code duplication. fil_parse_write_crypt_data(): Check that key read from redo log entry is found from encryption plugin and if it is not, refuse to start. PageConverter::validate(): Removed access to fil_space_t as tablespace is not available during import. Fixed error code on innodb.innodb test. Merged test cased innodb-bad-key-change5 and innodb-bad-key-shutdown to innodb-bad-key-change2. Removed innodb-bad-key-change5 test. Decreased unnecessary complexity on some long lasting tests. Removed fil_inc_pending_ops(), fil_decr_pending_ops(), fil_get_first_space(), fil_get_next_space(), fil_get_first_space_safe(), fil_get_next_space_safe() functions. fil_space_verify_crypt_checksum(): Fixed bug found using ASAN where FIL_PAGE_END_LSN_OLD_CHECKSUM field was incorrectly accessed from row compressed tables. Fixed out of page frame bug for row compressed tables in fil_space_verify_crypt_checksum() found using ASAN. Incorrect function was called for compressed table. Added new tests for discard, rename table and drop (we should allow them even when page decryption fails). Alter table rename is not allowed. Added test for restart with innodb-force-recovery=1 when page read on redo-recovery cant be decrypted. Added test for corrupted table where both page data and FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION is corrupted. Adjusted the test case innodb_bug14147491 so that it does not anymore expect crash. Instead table is just mostly not usable. fil0fil.h: fil_space_acquire_low is not visible function and fil_space_acquire and fil_space_acquire_silent are inline functions. FilSpace class uses fil_space_acquire_low directly. recv_apply_hashed_log_recs() does not return anything.
9 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
9 years ago
9 years ago
9 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
9 years ago
9 years ago
9 years ago
9 years ago
MDEV-12873 InnoDB SYS_TABLES.TYPE incompatibility for PAGE_COMPRESSED=YES in MariaDB 10.2.2 to 10.2.6 Remove the SHARED_SPACE flag that was erroneously introduced in MariaDB 10.2.2, and shift the SYS_TABLES.TYPE flags back to where they were before MariaDB 10.2.2. While doing this, ensure that tables created with affected MariaDB versions can be loaded, and also ensure that tables created with MySQL 5.7 using the TABLESPACE attribute cannot be loaded. MariaDB 10.2.2 picked the SHARED_SPACE flag from MySQL 5.7, shifting the MariaDB 10.1 flags PAGE_COMPRESSION, PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES by one bit. The SHARED_SPACE flag would always be written as 0 by MariaDB, because MariaDB does not support CREATE TABLESPACE or CREATE TABLE...TABLESPACE for InnoDB. So, instead of the bits AALLLLCxxxxxxx we would have AALLLLC0xxxxxxx if the table was created with MariaDB 10.2.2 to 10.2.6. (AA=ATOMIC_WRITES, LLLL=PAGE_COMPRESSION_LEVEL, C=PAGE_COMPRESSED, xxxxxxx=7 bits that were not moved.) PAGE_COMPRESSED=NO implies LLLLC=00000. That is not a problem. If someone created a table in MariaDB 10.2.2 or 10.2.3 with the attribute ATOMIC_WRITES=OFF (value 2; AA=10) and without PAGE_COMPRESSED=YES or PAGE_COMPRESSION_LEVEL, the table should be rejected. We ignore this problem, because it should be unlikely for anyone to specify ATOMIC_WRITES=OFF, and because 10.2.2 and 10.2.2 were not mature releases. The value ATOMIC_WRITES=ON (1) would be interpreted as ATOMIC_WRITES=OFF, but starting with MariaDB 10.2.4 the ATOMIC_WRITES attribute is ignored. PAGE_COMPRESSED=YES implies that PAGE_COMPRESSION_LEVEL be between 1 and 9 and that ROW_FORMAT be COMPACT or DYNAMIC. Thus, the affected wrong bit pattern in SYS_TABLES.TYPE is of the form AALLLL10DB00001 where D signals the presence of a DATA DIRECTORY attribute and B is 1 for ROW_FORMAT=DYNAMIC and 0 for ROW_FORMAT=COMPACT. We must interpret this bit pattern as AALLLL1DB00001 (discarding the extraneous 0 bit). dict_sys_tables_rec_read(): Adjust the affected bit pattern when reading the SYS_TABLES.TYPE column. In case of invalid flags, report both SYS_TABLES.TYPE (after possible adjustment) and SYS_TABLES.MIX_LEN. dict_load_table_one(): Replace an unreachable condition on !dict_tf2_is_valid() with a debug assertion. The flags will already have been validated by dict_sys_tables_rec_read(); if that validation fails, dict_load_table_low() will have failed. fil_ibd_create(): Shorten an error message about a file pre-existing. Datafile::validate_to_dd(): Clarify an error message about tablespace flags mismatch. ha_innobase::open(): Remove an unnecessary warning message. dict_tf_is_valid(): Simplify and stricten the logic. Validate the values of PAGE_COMPRESSION. Remove error log output; let the callers handle that. DICT_TF_BITS: Remove ATOMIC_WRITES, PAGE_ENCRYPTION, PAGE_ENCRYPTION_KEY. The ATOMIC_WRITES is ignored once the SYS_TABLES.TYPE has been validated; there is no need to store it in dict_table_t::flags. The PAGE_ENCRYPTION and PAGE_ENCRYPTION_KEY are unused since MariaDB 10.1.4 (the GA release was 10.1.8). DICT_TF_BIT_MASK: Remove (unused). FSP_FLAGS_MEM_ATOMIC_WRITES: Remove (the flags are never read). row_import_read_v1(): Display an error if dict_tf_is_valid() fails.
8 years ago
MDEV-12873 InnoDB SYS_TABLES.TYPE incompatibility for PAGE_COMPRESSED=YES in MariaDB 10.2.2 to 10.2.6 Remove the SHARED_SPACE flag that was erroneously introduced in MariaDB 10.2.2, and shift the SYS_TABLES.TYPE flags back to where they were before MariaDB 10.2.2. While doing this, ensure that tables created with affected MariaDB versions can be loaded, and also ensure that tables created with MySQL 5.7 using the TABLESPACE attribute cannot be loaded. MariaDB 10.2.2 picked the SHARED_SPACE flag from MySQL 5.7, shifting the MariaDB 10.1 flags PAGE_COMPRESSION, PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES by one bit. The SHARED_SPACE flag would always be written as 0 by MariaDB, because MariaDB does not support CREATE TABLESPACE or CREATE TABLE...TABLESPACE for InnoDB. So, instead of the bits AALLLLCxxxxxxx we would have AALLLLC0xxxxxxx if the table was created with MariaDB 10.2.2 to 10.2.6. (AA=ATOMIC_WRITES, LLLL=PAGE_COMPRESSION_LEVEL, C=PAGE_COMPRESSED, xxxxxxx=7 bits that were not moved.) PAGE_COMPRESSED=NO implies LLLLC=00000. That is not a problem. If someone created a table in MariaDB 10.2.2 or 10.2.3 with the attribute ATOMIC_WRITES=OFF (value 2; AA=10) and without PAGE_COMPRESSED=YES or PAGE_COMPRESSION_LEVEL, the table should be rejected. We ignore this problem, because it should be unlikely for anyone to specify ATOMIC_WRITES=OFF, and because 10.2.2 and 10.2.2 were not mature releases. The value ATOMIC_WRITES=ON (1) would be interpreted as ATOMIC_WRITES=OFF, but starting with MariaDB 10.2.4 the ATOMIC_WRITES attribute is ignored. PAGE_COMPRESSED=YES implies that PAGE_COMPRESSION_LEVEL be between 1 and 9 and that ROW_FORMAT be COMPACT or DYNAMIC. Thus, the affected wrong bit pattern in SYS_TABLES.TYPE is of the form AALLLL10DB00001 where D signals the presence of a DATA DIRECTORY attribute and B is 1 for ROW_FORMAT=DYNAMIC and 0 for ROW_FORMAT=COMPACT. We must interpret this bit pattern as AALLLL1DB00001 (discarding the extraneous 0 bit). dict_sys_tables_rec_read(): Adjust the affected bit pattern when reading the SYS_TABLES.TYPE column. In case of invalid flags, report both SYS_TABLES.TYPE (after possible adjustment) and SYS_TABLES.MIX_LEN. dict_load_table_one(): Replace an unreachable condition on !dict_tf2_is_valid() with a debug assertion. The flags will already have been validated by dict_sys_tables_rec_read(); if that validation fails, dict_load_table_low() will have failed. fil_ibd_create(): Shorten an error message about a file pre-existing. Datafile::validate_to_dd(): Clarify an error message about tablespace flags mismatch. ha_innobase::open(): Remove an unnecessary warning message. dict_tf_is_valid(): Simplify and stricten the logic. Validate the values of PAGE_COMPRESSION. Remove error log output; let the callers handle that. DICT_TF_BITS: Remove ATOMIC_WRITES, PAGE_ENCRYPTION, PAGE_ENCRYPTION_KEY. The ATOMIC_WRITES is ignored once the SYS_TABLES.TYPE has been validated; there is no need to store it in dict_table_t::flags. The PAGE_ENCRYPTION and PAGE_ENCRYPTION_KEY are unused since MariaDB 10.1.4 (the GA release was 10.1.8). DICT_TF_BIT_MASK: Remove (unused). FSP_FLAGS_MEM_ATOMIC_WRITES: Remove (the flags are never read). row_import_read_v1(): Display an error if dict_tf_is_valid() fails.
8 years ago
MDEV-12873 InnoDB SYS_TABLES.TYPE incompatibility for PAGE_COMPRESSED=YES in MariaDB 10.2.2 to 10.2.6 Remove the SHARED_SPACE flag that was erroneously introduced in MariaDB 10.2.2, and shift the SYS_TABLES.TYPE flags back to where they were before MariaDB 10.2.2. While doing this, ensure that tables created with affected MariaDB versions can be loaded, and also ensure that tables created with MySQL 5.7 using the TABLESPACE attribute cannot be loaded. MariaDB 10.2.2 picked the SHARED_SPACE flag from MySQL 5.7, shifting the MariaDB 10.1 flags PAGE_COMPRESSION, PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES by one bit. The SHARED_SPACE flag would always be written as 0 by MariaDB, because MariaDB does not support CREATE TABLESPACE or CREATE TABLE...TABLESPACE for InnoDB. So, instead of the bits AALLLLCxxxxxxx we would have AALLLLC0xxxxxxx if the table was created with MariaDB 10.2.2 to 10.2.6. (AA=ATOMIC_WRITES, LLLL=PAGE_COMPRESSION_LEVEL, C=PAGE_COMPRESSED, xxxxxxx=7 bits that were not moved.) PAGE_COMPRESSED=NO implies LLLLC=00000. That is not a problem. If someone created a table in MariaDB 10.2.2 or 10.2.3 with the attribute ATOMIC_WRITES=OFF (value 2; AA=10) and without PAGE_COMPRESSED=YES or PAGE_COMPRESSION_LEVEL, the table should be rejected. We ignore this problem, because it should be unlikely for anyone to specify ATOMIC_WRITES=OFF, and because 10.2.2 and 10.2.2 were not mature releases. The value ATOMIC_WRITES=ON (1) would be interpreted as ATOMIC_WRITES=OFF, but starting with MariaDB 10.2.4 the ATOMIC_WRITES attribute is ignored. PAGE_COMPRESSED=YES implies that PAGE_COMPRESSION_LEVEL be between 1 and 9 and that ROW_FORMAT be COMPACT or DYNAMIC. Thus, the affected wrong bit pattern in SYS_TABLES.TYPE is of the form AALLLL10DB00001 where D signals the presence of a DATA DIRECTORY attribute and B is 1 for ROW_FORMAT=DYNAMIC and 0 for ROW_FORMAT=COMPACT. We must interpret this bit pattern as AALLLL1DB00001 (discarding the extraneous 0 bit). dict_sys_tables_rec_read(): Adjust the affected bit pattern when reading the SYS_TABLES.TYPE column. In case of invalid flags, report both SYS_TABLES.TYPE (after possible adjustment) and SYS_TABLES.MIX_LEN. dict_load_table_one(): Replace an unreachable condition on !dict_tf2_is_valid() with a debug assertion. The flags will already have been validated by dict_sys_tables_rec_read(); if that validation fails, dict_load_table_low() will have failed. fil_ibd_create(): Shorten an error message about a file pre-existing. Datafile::validate_to_dd(): Clarify an error message about tablespace flags mismatch. ha_innobase::open(): Remove an unnecessary warning message. dict_tf_is_valid(): Simplify and stricten the logic. Validate the values of PAGE_COMPRESSION. Remove error log output; let the callers handle that. DICT_TF_BITS: Remove ATOMIC_WRITES, PAGE_ENCRYPTION, PAGE_ENCRYPTION_KEY. The ATOMIC_WRITES is ignored once the SYS_TABLES.TYPE has been validated; there is no need to store it in dict_table_t::flags. The PAGE_ENCRYPTION and PAGE_ENCRYPTION_KEY are unused since MariaDB 10.1.4 (the GA release was 10.1.8). DICT_TF_BIT_MASK: Remove (unused). FSP_FLAGS_MEM_ATOMIC_WRITES: Remove (the flags are never read). row_import_read_v1(): Display an error if dict_tf_is_valid() fails.
8 years ago
MDEV-12873 InnoDB SYS_TABLES.TYPE incompatibility for PAGE_COMPRESSED=YES in MariaDB 10.2.2 to 10.2.6 Remove the SHARED_SPACE flag that was erroneously introduced in MariaDB 10.2.2, and shift the SYS_TABLES.TYPE flags back to where they were before MariaDB 10.2.2. While doing this, ensure that tables created with affected MariaDB versions can be loaded, and also ensure that tables created with MySQL 5.7 using the TABLESPACE attribute cannot be loaded. MariaDB 10.2.2 picked the SHARED_SPACE flag from MySQL 5.7, shifting the MariaDB 10.1 flags PAGE_COMPRESSION, PAGE_COMPRESSION_LEVEL, ATOMIC_WRITES by one bit. The SHARED_SPACE flag would always be written as 0 by MariaDB, because MariaDB does not support CREATE TABLESPACE or CREATE TABLE...TABLESPACE for InnoDB. So, instead of the bits AALLLLCxxxxxxx we would have AALLLLC0xxxxxxx if the table was created with MariaDB 10.2.2 to 10.2.6. (AA=ATOMIC_WRITES, LLLL=PAGE_COMPRESSION_LEVEL, C=PAGE_COMPRESSED, xxxxxxx=7 bits that were not moved.) PAGE_COMPRESSED=NO implies LLLLC=00000. That is not a problem. If someone created a table in MariaDB 10.2.2 or 10.2.3 with the attribute ATOMIC_WRITES=OFF (value 2; AA=10) and without PAGE_COMPRESSED=YES or PAGE_COMPRESSION_LEVEL, the table should be rejected. We ignore this problem, because it should be unlikely for anyone to specify ATOMIC_WRITES=OFF, and because 10.2.2 and 10.2.2 were not mature releases. The value ATOMIC_WRITES=ON (1) would be interpreted as ATOMIC_WRITES=OFF, but starting with MariaDB 10.2.4 the ATOMIC_WRITES attribute is ignored. PAGE_COMPRESSED=YES implies that PAGE_COMPRESSION_LEVEL be between 1 and 9 and that ROW_FORMAT be COMPACT or DYNAMIC. Thus, the affected wrong bit pattern in SYS_TABLES.TYPE is of the form AALLLL10DB00001 where D signals the presence of a DATA DIRECTORY attribute and B is 1 for ROW_FORMAT=DYNAMIC and 0 for ROW_FORMAT=COMPACT. We must interpret this bit pattern as AALLLL1DB00001 (discarding the extraneous 0 bit). dict_sys_tables_rec_read(): Adjust the affected bit pattern when reading the SYS_TABLES.TYPE column. In case of invalid flags, report both SYS_TABLES.TYPE (after possible adjustment) and SYS_TABLES.MIX_LEN. dict_load_table_one(): Replace an unreachable condition on !dict_tf2_is_valid() with a debug assertion. The flags will already have been validated by dict_sys_tables_rec_read(); if that validation fails, dict_load_table_low() will have failed. fil_ibd_create(): Shorten an error message about a file pre-existing. Datafile::validate_to_dd(): Clarify an error message about tablespace flags mismatch. ha_innobase::open(): Remove an unnecessary warning message. dict_tf_is_valid(): Simplify and stricten the logic. Validate the values of PAGE_COMPRESSION. Remove error log output; let the callers handle that. DICT_TF_BITS: Remove ATOMIC_WRITES, PAGE_ENCRYPTION, PAGE_ENCRYPTION_KEY. The ATOMIC_WRITES is ignored once the SYS_TABLES.TYPE has been validated; there is no need to store it in dict_table_t::flags. The PAGE_ENCRYPTION and PAGE_ENCRYPTION_KEY are unused since MariaDB 10.1.4 (the GA release was 10.1.8). DICT_TF_BIT_MASK: Remove (unused). FSP_FLAGS_MEM_ATOMIC_WRITES: Remove (the flags are never read). row_import_read_v1(): Display an error if dict_tf_is_valid() fails.
8 years ago
9 years ago
9 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-18644: Support full_crc32 for page_compressed This is a follow-up task to MDEV-12026, which introduced innodb_checksum_algorithm=full_crc32 and a simpler page format. MDEV-12026 did not enable full_crc32 for page_compressed tables, which we will be doing now. This is joint work with Thirunarayanan Balathandayuthapani. For innodb_checksum_algorithm=full_crc32 we change the page_compressed format as follows: FIL_PAGE_TYPE: The most significant bit will be set to indicate page_compressed format. The least significant bits will contain the compressed page size, rounded up to a multiple of 256 bytes. The checksum will be stored in the last 4 bytes of the page (whether it is the full page or a page_compressed page whose size is determined by FIL_PAGE_TYPE), covering all preceding bytes of the page. If encryption is used, then the page will be encrypted between compression and computing the checksum. For page_compressed, FIL_PAGE_LSN will not be repeated at the end of the page. FSP_SPACE_FLAGS (already implemented as part of MDEV-12026): We will store the innodb_compression_algorithm that may be used to compress pages. Previously, the choice of algorithm was written to each compressed data page separately, and one would be unable to know in advance which compression algorithm(s) are used. fil_space_t::full_crc32_page_compressed_len(): Determine if the page_compressed algorithm of the tablespace needs to know the exact length of the compressed data. If yes, we will reserve and write an extra byte for this right before the checksum. buf_page_is_compressed(): Determine if a page uses page_compressed (in any innodb_checksum_algorithm). fil_page_decompress(): Pass also fil_space_t::flags so that the format can be determined. buf_page_is_zeroes(): Check if a page is full of zero bytes. buf_page_full_crc32_is_corrupted(): Renamed from buf_encrypted_full_crc32_page_is_corrupted(). For full_crc32, we always simply validate the checksum to the page contents, while the physical page size is explicitly specified by an unencrypted part of the page header. buf_page_full_crc32_size(): Determine the size of a full_crc32 page. buf_dblwr_check_page_lsn(): Make this a debug-only function, because it involves potentially costly lookups of fil_space_t. create_table_info_t::check_table_options(), ha_innobase::check_if_supported_inplace_alter(): Do allow the creation of SPATIAL INDEX with full_crc32 also when page_compressed is used. commit_cache_norebuild(): Preserve the compression algorithm when updating the page_compression_level. dict_tf_to_fsp_flags(): Set the flags for page compression algorithm. FIXME: Maybe there should be a table option page_compression_algorithm and a session variable to back it?
7 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-18644: Support full_crc32 for page_compressed This is a follow-up task to MDEV-12026, which introduced innodb_checksum_algorithm=full_crc32 and a simpler page format. MDEV-12026 did not enable full_crc32 for page_compressed tables, which we will be doing now. This is joint work with Thirunarayanan Balathandayuthapani. For innodb_checksum_algorithm=full_crc32 we change the page_compressed format as follows: FIL_PAGE_TYPE: The most significant bit will be set to indicate page_compressed format. The least significant bits will contain the compressed page size, rounded up to a multiple of 256 bytes. The checksum will be stored in the last 4 bytes of the page (whether it is the full page or a page_compressed page whose size is determined by FIL_PAGE_TYPE), covering all preceding bytes of the page. If encryption is used, then the page will be encrypted between compression and computing the checksum. For page_compressed, FIL_PAGE_LSN will not be repeated at the end of the page. FSP_SPACE_FLAGS (already implemented as part of MDEV-12026): We will store the innodb_compression_algorithm that may be used to compress pages. Previously, the choice of algorithm was written to each compressed data page separately, and one would be unable to know in advance which compression algorithm(s) are used. fil_space_t::full_crc32_page_compressed_len(): Determine if the page_compressed algorithm of the tablespace needs to know the exact length of the compressed data. If yes, we will reserve and write an extra byte for this right before the checksum. buf_page_is_compressed(): Determine if a page uses page_compressed (in any innodb_checksum_algorithm). fil_page_decompress(): Pass also fil_space_t::flags so that the format can be determined. buf_page_is_zeroes(): Check if a page is full of zero bytes. buf_page_full_crc32_is_corrupted(): Renamed from buf_encrypted_full_crc32_page_is_corrupted(). For full_crc32, we always simply validate the checksum to the page contents, while the physical page size is explicitly specified by an unencrypted part of the page header. buf_page_full_crc32_size(): Determine the size of a full_crc32 page. buf_dblwr_check_page_lsn(): Make this a debug-only function, because it involves potentially costly lookups of fil_space_t. create_table_info_t::check_table_options(), ha_innobase::check_if_supported_inplace_alter(): Do allow the creation of SPATIAL INDEX with full_crc32 also when page_compressed is used. commit_cache_norebuild(): Preserve the compression algorithm when updating the page_compression_level. dict_tf_to_fsp_flags(): Set the flags for page compression algorithm. FIXME: Maybe there should be a table option page_compression_algorithm and a session variable to back it?
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-18644: Support full_crc32 for page_compressed This is a follow-up task to MDEV-12026, which introduced innodb_checksum_algorithm=full_crc32 and a simpler page format. MDEV-12026 did not enable full_crc32 for page_compressed tables, which we will be doing now. This is joint work with Thirunarayanan Balathandayuthapani. For innodb_checksum_algorithm=full_crc32 we change the page_compressed format as follows: FIL_PAGE_TYPE: The most significant bit will be set to indicate page_compressed format. The least significant bits will contain the compressed page size, rounded up to a multiple of 256 bytes. The checksum will be stored in the last 4 bytes of the page (whether it is the full page or a page_compressed page whose size is determined by FIL_PAGE_TYPE), covering all preceding bytes of the page. If encryption is used, then the page will be encrypted between compression and computing the checksum. For page_compressed, FIL_PAGE_LSN will not be repeated at the end of the page. FSP_SPACE_FLAGS (already implemented as part of MDEV-12026): We will store the innodb_compression_algorithm that may be used to compress pages. Previously, the choice of algorithm was written to each compressed data page separately, and one would be unable to know in advance which compression algorithm(s) are used. fil_space_t::full_crc32_page_compressed_len(): Determine if the page_compressed algorithm of the tablespace needs to know the exact length of the compressed data. If yes, we will reserve and write an extra byte for this right before the checksum. buf_page_is_compressed(): Determine if a page uses page_compressed (in any innodb_checksum_algorithm). fil_page_decompress(): Pass also fil_space_t::flags so that the format can be determined. buf_page_is_zeroes(): Check if a page is full of zero bytes. buf_page_full_crc32_is_corrupted(): Renamed from buf_encrypted_full_crc32_page_is_corrupted(). For full_crc32, we always simply validate the checksum to the page contents, while the physical page size is explicitly specified by an unencrypted part of the page header. buf_page_full_crc32_size(): Determine the size of a full_crc32 page. buf_dblwr_check_page_lsn(): Make this a debug-only function, because it involves potentially costly lookups of fil_space_t. create_table_info_t::check_table_options(), ha_innobase::check_if_supported_inplace_alter(): Do allow the creation of SPATIAL INDEX with full_crc32 also when page_compressed is used. commit_cache_norebuild(): Preserve the compression algorithm when updating the page_compression_level. dict_tf_to_fsp_flags(): Set the flags for page compression algorithm. FIXME: Maybe there should be a table option page_compression_algorithm and a session variable to back it?
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-18644: Support full_crc32 for page_compressed This is a follow-up task to MDEV-12026, which introduced innodb_checksum_algorithm=full_crc32 and a simpler page format. MDEV-12026 did not enable full_crc32 for page_compressed tables, which we will be doing now. This is joint work with Thirunarayanan Balathandayuthapani. For innodb_checksum_algorithm=full_crc32 we change the page_compressed format as follows: FIL_PAGE_TYPE: The most significant bit will be set to indicate page_compressed format. The least significant bits will contain the compressed page size, rounded up to a multiple of 256 bytes. The checksum will be stored in the last 4 bytes of the page (whether it is the full page or a page_compressed page whose size is determined by FIL_PAGE_TYPE), covering all preceding bytes of the page. If encryption is used, then the page will be encrypted between compression and computing the checksum. For page_compressed, FIL_PAGE_LSN will not be repeated at the end of the page. FSP_SPACE_FLAGS (already implemented as part of MDEV-12026): We will store the innodb_compression_algorithm that may be used to compress pages. Previously, the choice of algorithm was written to each compressed data page separately, and one would be unable to know in advance which compression algorithm(s) are used. fil_space_t::full_crc32_page_compressed_len(): Determine if the page_compressed algorithm of the tablespace needs to know the exact length of the compressed data. If yes, we will reserve and write an extra byte for this right before the checksum. buf_page_is_compressed(): Determine if a page uses page_compressed (in any innodb_checksum_algorithm). fil_page_decompress(): Pass also fil_space_t::flags so that the format can be determined. buf_page_is_zeroes(): Check if a page is full of zero bytes. buf_page_full_crc32_is_corrupted(): Renamed from buf_encrypted_full_crc32_page_is_corrupted(). For full_crc32, we always simply validate the checksum to the page contents, while the physical page size is explicitly specified by an unencrypted part of the page header. buf_page_full_crc32_size(): Determine the size of a full_crc32 page. buf_dblwr_check_page_lsn(): Make this a debug-only function, because it involves potentially costly lookups of fil_space_t. create_table_info_t::check_table_options(), ha_innobase::check_if_supported_inplace_alter(): Do allow the creation of SPATIAL INDEX with full_crc32 also when page_compressed is used. commit_cache_norebuild(): Preserve the compression algorithm when updating the page_compression_level. dict_tf_to_fsp_flags(): Set the flags for page compression algorithm. FIXME: Maybe there should be a table option page_compression_algorithm and a session variable to back it?
7 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-18644: Support full_crc32 for page_compressed This is a follow-up task to MDEV-12026, which introduced innodb_checksum_algorithm=full_crc32 and a simpler page format. MDEV-12026 did not enable full_crc32 for page_compressed tables, which we will be doing now. This is joint work with Thirunarayanan Balathandayuthapani. For innodb_checksum_algorithm=full_crc32 we change the page_compressed format as follows: FIL_PAGE_TYPE: The most significant bit will be set to indicate page_compressed format. The least significant bits will contain the compressed page size, rounded up to a multiple of 256 bytes. The checksum will be stored in the last 4 bytes of the page (whether it is the full page or a page_compressed page whose size is determined by FIL_PAGE_TYPE), covering all preceding bytes of the page. If encryption is used, then the page will be encrypted between compression and computing the checksum. For page_compressed, FIL_PAGE_LSN will not be repeated at the end of the page. FSP_SPACE_FLAGS (already implemented as part of MDEV-12026): We will store the innodb_compression_algorithm that may be used to compress pages. Previously, the choice of algorithm was written to each compressed data page separately, and one would be unable to know in advance which compression algorithm(s) are used. fil_space_t::full_crc32_page_compressed_len(): Determine if the page_compressed algorithm of the tablespace needs to know the exact length of the compressed data. If yes, we will reserve and write an extra byte for this right before the checksum. buf_page_is_compressed(): Determine if a page uses page_compressed (in any innodb_checksum_algorithm). fil_page_decompress(): Pass also fil_space_t::flags so that the format can be determined. buf_page_is_zeroes(): Check if a page is full of zero bytes. buf_page_full_crc32_is_corrupted(): Renamed from buf_encrypted_full_crc32_page_is_corrupted(). For full_crc32, we always simply validate the checksum to the page contents, while the physical page size is explicitly specified by an unencrypted part of the page header. buf_page_full_crc32_size(): Determine the size of a full_crc32 page. buf_dblwr_check_page_lsn(): Make this a debug-only function, because it involves potentially costly lookups of fil_space_t. create_table_info_t::check_table_options(), ha_innobase::check_if_supported_inplace_alter(): Do allow the creation of SPATIAL INDEX with full_crc32 also when page_compressed is used. commit_cache_norebuild(): Preserve the compression algorithm when updating the page_compression_level. dict_tf_to_fsp_flags(): Set the flags for page compression algorithm. FIXME: Maybe there should be a table option page_compression_algorithm and a session variable to back it?
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-18644: Support full_crc32 for page_compressed This is a follow-up task to MDEV-12026, which introduced innodb_checksum_algorithm=full_crc32 and a simpler page format. MDEV-12026 did not enable full_crc32 for page_compressed tables, which we will be doing now. This is joint work with Thirunarayanan Balathandayuthapani. For innodb_checksum_algorithm=full_crc32 we change the page_compressed format as follows: FIL_PAGE_TYPE: The most significant bit will be set to indicate page_compressed format. The least significant bits will contain the compressed page size, rounded up to a multiple of 256 bytes. The checksum will be stored in the last 4 bytes of the page (whether it is the full page or a page_compressed page whose size is determined by FIL_PAGE_TYPE), covering all preceding bytes of the page. If encryption is used, then the page will be encrypted between compression and computing the checksum. For page_compressed, FIL_PAGE_LSN will not be repeated at the end of the page. FSP_SPACE_FLAGS (already implemented as part of MDEV-12026): We will store the innodb_compression_algorithm that may be used to compress pages. Previously, the choice of algorithm was written to each compressed data page separately, and one would be unable to know in advance which compression algorithm(s) are used. fil_space_t::full_crc32_page_compressed_len(): Determine if the page_compressed algorithm of the tablespace needs to know the exact length of the compressed data. If yes, we will reserve and write an extra byte for this right before the checksum. buf_page_is_compressed(): Determine if a page uses page_compressed (in any innodb_checksum_algorithm). fil_page_decompress(): Pass also fil_space_t::flags so that the format can be determined. buf_page_is_zeroes(): Check if a page is full of zero bytes. buf_page_full_crc32_is_corrupted(): Renamed from buf_encrypted_full_crc32_page_is_corrupted(). For full_crc32, we always simply validate the checksum to the page contents, while the physical page size is explicitly specified by an unencrypted part of the page header. buf_page_full_crc32_size(): Determine the size of a full_crc32 page. buf_dblwr_check_page_lsn(): Make this a debug-only function, because it involves potentially costly lookups of fil_space_t. create_table_info_t::check_table_options(), ha_innobase::check_if_supported_inplace_alter(): Do allow the creation of SPATIAL INDEX with full_crc32 also when page_compressed is used. commit_cache_norebuild(): Preserve the compression algorithm when updating the page_compression_level. dict_tf_to_fsp_flags(): Set the flags for page compression algorithm. FIXME: Maybe there should be a table option page_compression_algorithm and a session variable to back it?
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.
7 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12219 Discard temporary undo logs at transaction commit Starting with MySQL 5.7, temporary tables in InnoDB are handled differently from persistent tables. Because temporary tables are private to a connection, concurrency control and multi-versioning (MVCC) are not applicable. For performance reasons, purge is disabled as well. Rollback is supported for temporary tables; that is why we have the temporary undo logs in the first place. Because MVCC and purge are disabled for temporary tables, we should discard all temporary undo logs already at transaction commit, just like we discard the persistent insert_undo logs. Before this change, update_undo logs were being preserved. trx_temp_undo_t: A wrapper for temporary undo logs, comprising a rollback segment and a single temporary undo log. trx_rsegs_t::m_noredo: Use trx_temp_undo_t. (Instead of insert_undo, update_undo, there will be a single undo.) trx_is_noredo_rseg_updated(), trx_is_rseg_assigned(): Remove. trx_undo_add_page(): Remove the parameter undo_ptr. Acquire and release the rollback segment mutex inside the function. trx_undo_free_last_page(): Remove the parameter trx. trx_undo_truncate_end(): Remove the parameter trx, and add the parameter is_temp. Clean up the code a bit. trx_undo_assign_undo(): Split the parameter undo_ptr into rseg, undo. trx_undo_commit_cleanup(): Renamed from trx_undo_insert_cleanup(). Replace the parameter undo_ptr with undo. This will discard the temporary undo or insert_undo log at commit/rollback. trx_purge_add_update_undo_to_history(), trx_undo_update_cleanup(): Remove 3 parameters. Always operate on the persistent update_undo. trx_serialise(): Renamed from trx_serialisation_number_get(). trx_write_serialisation_history(): Simplify the code flow. If there are no persistent changes, do not update MONITOR_TRX_COMMIT_UNDO. trx_commit_in_memory(): Simplify the logic, and add assertions. trx_undo_page_report_modify(): Keep a direct reference to the persistent update_undo log. trx_undo_report_row_operation(): Simplify some code. Always assign TRX_UNDO_INSERT for temporary undo logs. trx_prepare_low(): Keep only one parameter. Prepare all 3 undo logs. trx_roll_try_truncate(): Remove the parameter undo_ptr. Try to truncate all 3 undo logs of the transaction. trx_roll_pop_top_rec_of_trx_low(): Remove. trx_roll_pop_top_rec_of_trx(): Remove the redundant parameter trx->roll_limit. Clear roll_limit when exhausting the undo logs. Consider all 3 undo logs at once, prioritizing the persistent undo logs. row_undo(): Minor cleanup. Let trx_roll_pop_top_rec_of_trx() reset the trx->roll_limit.
9 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-12253: Buffer pool blocks are accessed after they have been freed Problem was that bpage was referenced after it was already freed from LRU. Fixed by adding a new variable encrypted that is passed down to buf_page_check_corrupt() and used in buf_page_get_gen() to stop processing page read. This patch should also address following test failures and bugs: MDEV-12419: IMPORT should not look up tablespace in PageConverter::validate(). This is now removed. MDEV-10099: encryption.innodb_onlinealter_encryption fails sporadically in buildbot MDEV-11420: encryption.innodb_encryption-page-compression failed in buildbot MDEV-11222: encryption.encrypt_and_grep failed in buildbot on P8 Removed dict_table_t::is_encrypted and dict_table_t::ibd_file_missing and replaced these with dict_table_t::file_unreadable. Table ibd file is missing if fil_get_space(space_id) returns NULL and encrypted if not. Removed dict_table_t::is_corrupted field. Ported FilSpace class from 10.2 and using that on buf_page_check_corrupt(), buf_page_decrypt_after_read(), buf_page_encrypt_before_write(), buf_dblwr_process(), buf_read_page(), dict_stats_save_defrag_stats(). Added test cases when enrypted page could be read while doing redo log crash recovery. Also added test case for row compressed blobs. btr_cur_open_at_index_side_func(), btr_cur_open_at_rnd_pos_func(): Avoid referencing block that is NULL. buf_page_get_zip(): Issue error if page read fails. buf_page_get_gen(): Use dberr_t for error detection and do not reference bpage after we hare freed it. buf_mark_space_corrupt(): remove bpage from LRU also when it is encrypted. buf_page_check_corrupt(): @return DB_SUCCESS if page has been read and is not corrupted, DB_PAGE_CORRUPTED if page based on checksum check is corrupted, DB_DECRYPTION_FAILED if page post encryption checksum matches but after decryption normal page checksum does not match. In read case only DB_SUCCESS is possible. buf_page_io_complete(): use dberr_t for error handling. buf_flush_write_block_low(), buf_read_ahead_random(), buf_read_page_async(), buf_read_ahead_linear(), buf_read_ibuf_merge_pages(), buf_read_recv_pages(), fil_aio_wait(): Issue error if page read fails. btr_pcur_move_to_next_page(): Do not reference page if it is NULL. Introduced dict_table_t::is_readable() and dict_index_t::is_readable() that will return true if tablespace exists and pages read from tablespace are not corrupted or page decryption failed. Removed buf_page_t::key_version. After page decryption the key version is not removed from page frame. For unencrypted pages, old key_version is removed at buf_page_encrypt_before_write() dict_stats_update_transient_for_index(), dict_stats_update_transient() Do not continue if table decryption failed or table is corrupted. dict0stats.cc: Introduced a dict_stats_report_error function to avoid code duplication. fil_parse_write_crypt_data(): Check that key read from redo log entry is found from encryption plugin and if it is not, refuse to start. PageConverter::validate(): Removed access to fil_space_t as tablespace is not available during import. Fixed error code on innodb.innodb test. Merged test cased innodb-bad-key-change5 and innodb-bad-key-shutdown to innodb-bad-key-change2. Removed innodb-bad-key-change5 test. Decreased unnecessary complexity on some long lasting tests. Removed fil_inc_pending_ops(), fil_decr_pending_ops(), fil_get_first_space(), fil_get_next_space(), fil_get_first_space_safe(), fil_get_next_space_safe() functions. fil_space_verify_crypt_checksum(): Fixed bug found using ASAN where FIL_PAGE_END_LSN_OLD_CHECKSUM field was incorrectly accessed from row compressed tables. Fixed out of page frame bug for row compressed tables in fil_space_verify_crypt_checksum() found using ASAN. Incorrect function was called for compressed table. Added new tests for discard, rename table and drop (we should allow them even when page decryption fails). Alter table rename is not allowed. Added test for restart with innodb-force-recovery=1 when page read on redo-recovery cant be decrypted. Added test for corrupted table where both page data and FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION is corrupted. Adjusted the test case innodb_bug14147491 so that it does not anymore expect crash. Instead table is just mostly not usable. fil0fil.h: fil_space_acquire_low is not visible function and fil_space_acquire and fil_space_acquire_silent are inline functions. FilSpace class uses fil_space_acquire_low directly. recv_apply_hashed_log_recs() does not return anything.
9 years ago
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with the notable difference that the file format changes are limited to repurposing a previously unused data field in B-tree pages. For persistent InnoDB tables, write the last used AUTO_INCREMENT value to the root page of the clustered index, in the previously unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC. Unlike some other previously unused InnoDB data fields, this one was actually always zero-initialized, at least since MySQL 3.23.49. The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the root page. The SX latch will allow concurrent read access to the root page. (The field PAGE_ROOT_AUTO_INC will only be read on the first-time call to ha_innobase::open() from the SQL layer. The PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so read/write races are not possible.) During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level function btr_cur_search_to_nth_level(), adding no extra page access. [Adaptive hash index lookup will be disabled during INSERT.] If some rare UPDATE modifies an AUTO_INCREMENT column, the PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in ha_innobase::update_row(). When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC field. During ALTER TABLE, the initial AUTO_INCREMENT value will be copied from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will update PAGE_ROOT_AUTO_INC in real time. innodb_col_no(): Determine the dict_table_t::cols[] element index corresponding to a Field of a non-virtual column. (The MySQL 5.7 implementation of virtual columns breaks the 1:1 relationship between Field::field_index and dict_table_t::cols[]. Virtual columns are omitted from dict_table_t::cols[]. Therefore, we must translate the field_index of AUTO_INCREMENT columns into an index of dict_table_t::cols[].) Upgrade from old data files: By default, the AUTO_INCREMENT sequence in old data files would appear to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain the value 0 in each clustered index page. In new data files, PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain any AUTO_INCREMENT column. For backward compatibility, we use the old method of SELECT MAX(auto_increment_column) for initializing the sequence. btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format data file. btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc() that will resort to reading MAX(auto_increment_column) for data files that did not use AUTO_INCREMENT yet. It was manually tested that during the execution of innodb.autoinc_persist the compatibility logic is not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty clustered index root pages). initialize_auto_increment(): Replaces ha_innobase::innobase_initialize_autoinc(). This initializes the AUTO_INCREMENT metadata. Only called from ha_innobase::open(). ha_innobase::info_low(): Do not try to lazily initialize dict_table_t::autoinc. It must already have been initialized by ha_innobase::open() or ha_innobase::create(). Note: The adjustments to class ha_innopart were not tested, because the source code (native InnoDB partitioning) is not being compiled.
9 years ago
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with the notable difference that the file format changes are limited to repurposing a previously unused data field in B-tree pages. For persistent InnoDB tables, write the last used AUTO_INCREMENT value to the root page of the clustered index, in the previously unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC. Unlike some other previously unused InnoDB data fields, this one was actually always zero-initialized, at least since MySQL 3.23.49. The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the root page. The SX latch will allow concurrent read access to the root page. (The field PAGE_ROOT_AUTO_INC will only be read on the first-time call to ha_innobase::open() from the SQL layer. The PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so read/write races are not possible.) During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level function btr_cur_search_to_nth_level(), adding no extra page access. [Adaptive hash index lookup will be disabled during INSERT.] If some rare UPDATE modifies an AUTO_INCREMENT column, the PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in ha_innobase::update_row(). When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC field. During ALTER TABLE, the initial AUTO_INCREMENT value will be copied from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will update PAGE_ROOT_AUTO_INC in real time. innodb_col_no(): Determine the dict_table_t::cols[] element index corresponding to a Field of a non-virtual column. (The MySQL 5.7 implementation of virtual columns breaks the 1:1 relationship between Field::field_index and dict_table_t::cols[]. Virtual columns are omitted from dict_table_t::cols[]. Therefore, we must translate the field_index of AUTO_INCREMENT columns into an index of dict_table_t::cols[].) Upgrade from old data files: By default, the AUTO_INCREMENT sequence in old data files would appear to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain the value 0 in each clustered index page. In new data files, PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain any AUTO_INCREMENT column. For backward compatibility, we use the old method of SELECT MAX(auto_increment_column) for initializing the sequence. btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format data file. btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc() that will resort to reading MAX(auto_increment_column) for data files that did not use AUTO_INCREMENT yet. It was manually tested that during the execution of innodb.autoinc_persist the compatibility logic is not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty clustered index root pages). initialize_auto_increment(): Replaces ha_innobase::innobase_initialize_autoinc(). This initializes the AUTO_INCREMENT metadata. Only called from ha_innobase::open(). ha_innobase::info_low(): Do not try to lazily initialize dict_table_t::autoinc. It must already have been initialized by ha_innobase::open() or ha_innobase::create(). Note: The adjustments to class ha_innopart were not tested, because the source code (native InnoDB partitioning) is not being compiled.
9 years ago
  1. /*****************************************************************************
  2. Copyright (c) 2012, 2016, Oracle and/or its affiliates. All Rights Reserved.
  3. Copyright (c) 2015, 2019, MariaDB Corporation.
  4. This program is free software; you can redistribute it and/or modify it under
  5. the terms of the GNU General Public License as published by the Free Software
  6. Foundation; version 2 of the License.
  7. This program is distributed in the hope that it will be useful, but WITHOUT
  8. ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
  9. FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
  10. You should have received a copy of the GNU General Public License along with
  11. this program; if not, write to the Free Software Foundation, Inc.,
  12. 51 Franklin Street, Fifth Floor, Boston, MA 02110-1335 USA
  13. *****************************************************************************/
  14. /**************************************************//**
  15. @file row/row0import.cc
  16. Import a tablespace to a running instance.
  17. Created 2012-02-08 by Sunny Bains.
  18. *******************************************************/
  19. #include "row0import.h"
  20. #include "btr0pcur.h"
  21. #include "que0que.h"
  22. #include "dict0boot.h"
  23. #include "dict0load.h"
  24. #include "ibuf0ibuf.h"
  25. #include "pars0pars.h"
  26. #include "row0sel.h"
  27. #include "row0mysql.h"
  28. #include "srv0start.h"
  29. #include "row0quiesce.h"
  30. #include "fil0pagecompress.h"
  31. #include "trx0undo.h"
  32. #ifdef HAVE_LZO
  33. #include "lzo/lzo1x.h"
  34. #endif
  35. #ifdef HAVE_SNAPPY
  36. #include "snappy-c.h"
  37. #endif
  38. #include <vector>
  39. #ifdef HAVE_MY_AES_H
  40. #include <my_aes.h>
  41. #endif
  42. /** The size of the buffer to use for IO.
  43. @param n physical page size
  44. @return number of pages */
  45. #define IO_BUFFER_SIZE(n) ((1024 * 1024) / (n))
  46. /** For gathering stats on records during phase I */
  47. struct row_stats_t {
  48. ulint m_n_deleted; /*!< Number of deleted records
  49. found in the index */
  50. ulint m_n_purged; /*!< Number of records purged
  51. optimisatically */
  52. ulint m_n_rows; /*!< Number of rows */
  53. ulint m_n_purge_failed; /*!< Number of deleted rows
  54. that could not be purged */
  55. };
  56. /** Index information required by IMPORT. */
  57. struct row_index_t {
  58. index_id_t m_id; /*!< Index id of the table
  59. in the exporting server */
  60. byte* m_name; /*!< Index name */
  61. ulint m_space; /*!< Space where it is placed */
  62. ulint m_page_no; /*!< Root page number */
  63. ulint m_type; /*!< Index type */
  64. ulint m_trx_id_offset; /*!< Relevant only for clustered
  65. indexes, offset of transaction
  66. id system column */
  67. ulint m_n_user_defined_cols; /*!< User defined columns */
  68. ulint m_n_uniq; /*!< Number of columns that can
  69. uniquely identify the row */
  70. ulint m_n_nullable; /*!< Number of nullable
  71. columns */
  72. ulint m_n_fields; /*!< Total number of fields */
  73. dict_field_t* m_fields; /*!< Index fields */
  74. const dict_index_t*
  75. m_srv_index; /*!< Index instance in the
  76. importing server */
  77. row_stats_t m_stats; /*!< Statistics gathered during
  78. the import phase */
  79. };
  80. /** Meta data required by IMPORT. */
  81. struct row_import {
  82. row_import() UNIV_NOTHROW
  83. :
  84. m_table(NULL),
  85. m_version(0),
  86. m_hostname(NULL),
  87. m_table_name(NULL),
  88. m_autoinc(0),
  89. m_zip_size(0),
  90. m_flags(0),
  91. m_n_cols(0),
  92. m_cols(NULL),
  93. m_col_names(NULL),
  94. m_n_indexes(0),
  95. m_indexes(NULL),
  96. m_missing(true) { }
  97. ~row_import() UNIV_NOTHROW;
  98. /** Find the index entry in in the indexes array.
  99. @param name index name
  100. @return instance if found else 0. */
  101. row_index_t* get_index(const char* name) const UNIV_NOTHROW;
  102. /** Get the number of rows in the index.
  103. @param name index name
  104. @return number of rows (doesn't include delete marked rows). */
  105. ulint get_n_rows(const char* name) const UNIV_NOTHROW;
  106. /** Find the ordinal value of the column name in the cfg table columns.
  107. @param name of column to look for.
  108. @return ULINT_UNDEFINED if not found. */
  109. ulint find_col(const char* name) const UNIV_NOTHROW;
  110. /** Get the number of rows for which purge failed during the
  111. convert phase.
  112. @param name index name
  113. @return number of rows for which purge failed. */
  114. ulint get_n_purge_failed(const char* name) const UNIV_NOTHROW;
  115. /** Check if the index is clean. ie. no delete-marked records
  116. @param name index name
  117. @return true if index needs to be purged. */
  118. bool requires_purge(const char* name) const UNIV_NOTHROW
  119. {
  120. return(get_n_purge_failed(name) > 0);
  121. }
  122. /** Set the index root <space, pageno> using the index name */
  123. void set_root_by_name() UNIV_NOTHROW;
  124. /** Set the index root <space, pageno> using a heuristic
  125. @return DB_SUCCESS or error code */
  126. dberr_t set_root_by_heuristic() UNIV_NOTHROW;
  127. /** Check if the index schema that was read from the .cfg file
  128. matches the in memory index definition.
  129. Note: It will update row_import_t::m_srv_index to map the meta-data
  130. read from the .cfg file to the server index instance.
  131. @return DB_SUCCESS or error code. */
  132. dberr_t match_index_columns(
  133. THD* thd,
  134. const dict_index_t* index) UNIV_NOTHROW;
  135. /** Check if the table schema that was read from the .cfg file
  136. matches the in memory table definition.
  137. @param thd MySQL session variable
  138. @return DB_SUCCESS or error code. */
  139. dberr_t match_table_columns(
  140. THD* thd) UNIV_NOTHROW;
  141. /** Check if the table (and index) schema that was read from the
  142. .cfg file matches the in memory table definition.
  143. @param thd MySQL session variable
  144. @return DB_SUCCESS or error code. */
  145. dberr_t match_schema(
  146. THD* thd) UNIV_NOTHROW;
  147. dict_table_t* m_table; /*!< Table instance */
  148. ulint m_version; /*!< Version of config file */
  149. byte* m_hostname; /*!< Hostname where the
  150. tablespace was exported */
  151. byte* m_table_name; /*!< Exporting instance table
  152. name */
  153. ib_uint64_t m_autoinc; /*!< Next autoinc value */
  154. ulint m_zip_size; /*!< ROW_FORMAT=COMPRESSED
  155. page size, or 0 */
  156. ulint m_flags; /*!< Table flags */
  157. ulint m_n_cols; /*!< Number of columns in the
  158. meta-data file */
  159. dict_col_t* m_cols; /*!< Column data */
  160. byte** m_col_names; /*!< Column names, we store the
  161. column naems separately becuase
  162. there is no field to store the
  163. value in dict_col_t */
  164. ulint m_n_indexes; /*!< Number of indexes,
  165. including clustered index */
  166. row_index_t* m_indexes; /*!< Index meta data */
  167. bool m_missing; /*!< true if a .cfg file was
  168. found and was readable */
  169. };
  170. /** Use the page cursor to iterate over records in a block. */
  171. class RecIterator {
  172. public:
  173. /** Default constructor */
  174. RecIterator() UNIV_NOTHROW
  175. {
  176. memset(&m_cur, 0x0, sizeof(m_cur));
  177. }
  178. /** Position the cursor on the first user record. */
  179. void open(buf_block_t* block) UNIV_NOTHROW
  180. {
  181. page_cur_set_before_first(block, &m_cur);
  182. if (!end()) {
  183. next();
  184. }
  185. }
  186. /** Move to the next record. */
  187. void next() UNIV_NOTHROW
  188. {
  189. page_cur_move_to_next(&m_cur);
  190. }
  191. /**
  192. @return the current record */
  193. rec_t* current() UNIV_NOTHROW
  194. {
  195. ut_ad(!end());
  196. return(page_cur_get_rec(&m_cur));
  197. }
  198. /**
  199. @return true if cursor is at the end */
  200. bool end() UNIV_NOTHROW
  201. {
  202. return(page_cur_is_after_last(&m_cur) == TRUE);
  203. }
  204. /** Remove the current record
  205. @return true on success */
  206. bool remove(
  207. const dict_index_t* index,
  208. page_zip_des_t* page_zip,
  209. ulint* offsets) UNIV_NOTHROW
  210. {
  211. /* We can't end up with an empty page unless it is root. */
  212. if (page_get_n_recs(m_cur.block->frame) <= 1) {
  213. return(false);
  214. }
  215. return(page_delete_rec(index, &m_cur, page_zip, offsets));
  216. }
  217. private:
  218. page_cur_t m_cur;
  219. };
  220. /** Class that purges delete marked reocords from indexes, both secondary
  221. and cluster. It does a pessimistic delete. This should only be done if we
  222. couldn't purge the delete marked reocrds during Phase I. */
  223. class IndexPurge {
  224. public:
  225. /** Constructor
  226. @param trx the user transaction covering the import tablespace
  227. @param index to be imported
  228. @param space_id space id of the tablespace */
  229. IndexPurge(
  230. trx_t* trx,
  231. dict_index_t* index) UNIV_NOTHROW
  232. :
  233. m_trx(trx),
  234. m_index(index),
  235. m_n_rows(0)
  236. {
  237. ib::info() << "Phase II - Purge records from index "
  238. << index->name;
  239. }
  240. /** Descructor */
  241. ~IndexPurge() UNIV_NOTHROW { }
  242. /** Purge delete marked records.
  243. @return DB_SUCCESS or error code. */
  244. dberr_t garbage_collect() UNIV_NOTHROW;
  245. /** The number of records that are not delete marked.
  246. @return total records in the index after purge */
  247. ulint get_n_rows() const UNIV_NOTHROW
  248. {
  249. return(m_n_rows);
  250. }
  251. private:
  252. /** Begin import, position the cursor on the first record. */
  253. void open() UNIV_NOTHROW;
  254. /** Close the persistent curosr and commit the mini-transaction. */
  255. void close() UNIV_NOTHROW;
  256. /** Position the cursor on the next record.
  257. @return DB_SUCCESS or error code */
  258. dberr_t next() UNIV_NOTHROW;
  259. /** Store the persistent cursor position and reopen the
  260. B-tree cursor in BTR_MODIFY_TREE mode, because the
  261. tree structure may be changed during a pessimistic delete. */
  262. void purge_pessimistic_delete() UNIV_NOTHROW;
  263. /** Purge delete-marked records.
  264. @param offsets current row offsets. */
  265. void purge() UNIV_NOTHROW;
  266. protected:
  267. // Disable copying
  268. IndexPurge();
  269. IndexPurge(const IndexPurge&);
  270. IndexPurge &operator=(const IndexPurge&);
  271. private:
  272. trx_t* m_trx; /*!< User transaction */
  273. mtr_t m_mtr; /*!< Mini-transaction */
  274. btr_pcur_t m_pcur; /*!< Persistent cursor */
  275. dict_index_t* m_index; /*!< Index to be processed */
  276. ulint m_n_rows; /*!< Records in index */
  277. };
  278. /** Functor that is called for each physical page that is read from the
  279. tablespace file. */
  280. class AbstractCallback
  281. {
  282. public:
  283. /** Constructor
  284. @param trx covering transaction */
  285. AbstractCallback(trx_t* trx, ulint space_id)
  286. :
  287. m_zip_size(0),
  288. m_trx(trx),
  289. m_space(space_id),
  290. m_xdes(),
  291. m_xdes_page_no(ULINT_UNDEFINED),
  292. m_space_flags(ULINT_UNDEFINED) UNIV_NOTHROW { }
  293. /** Free any extent descriptor instance */
  294. virtual ~AbstractCallback()
  295. {
  296. UT_DELETE_ARRAY(m_xdes);
  297. }
  298. /** Determine the page size to use for traversing the tablespace
  299. @param file_size size of the tablespace file in bytes
  300. @param block contents of the first page in the tablespace file.
  301. @retval DB_SUCCESS or error code. */
  302. virtual dberr_t init(
  303. os_offset_t file_size,
  304. const buf_block_t* block) UNIV_NOTHROW;
  305. /** @return true if compressed table. */
  306. bool is_compressed_table() const UNIV_NOTHROW
  307. {
  308. return get_zip_size();
  309. }
  310. /** @return the tablespace flags */
  311. ulint get_space_flags() const
  312. {
  313. return(m_space_flags);
  314. }
  315. /**
  316. Set the name of the physical file and the file handle that is used
  317. to open it for the file that is being iterated over.
  318. @param filename the physical name of the tablespace file
  319. @param file OS file handle */
  320. void set_file(const char* filename, pfs_os_file_t file) UNIV_NOTHROW
  321. {
  322. m_file = file;
  323. m_filepath = filename;
  324. }
  325. ulint get_zip_size() const { return m_zip_size; }
  326. ulint physical_size() const
  327. {
  328. return m_zip_size ? m_zip_size : srv_page_size;
  329. }
  330. const char* filename() const { return m_filepath; }
  331. /**
  332. Called for every page in the tablespace. If the page was not
  333. updated then its state must be set to BUF_PAGE_NOT_USED. For
  334. compressed tables the page descriptor memory will be at offset:
  335. block->frame + srv_page_size;
  336. @param block block read from file, note it is not from the buffer pool
  337. @retval DB_SUCCESS or error code. */
  338. virtual dberr_t operator()(buf_block_t* block) UNIV_NOTHROW = 0;
  339. /** @return the tablespace identifier */
  340. ulint get_space_id() const { return m_space; }
  341. bool is_interrupted() const { return trx_is_interrupted(m_trx); }
  342. /**
  343. Get the data page depending on the table type, compressed or not.
  344. @param block - block read from disk
  345. @retval the buffer frame */
  346. static byte* get_frame(const buf_block_t* block)
  347. {
  348. return block->page.zip.data
  349. ? block->page.zip.data : block->frame;
  350. }
  351. protected:
  352. /** Get the physical offset of the extent descriptor within the page.
  353. @param page_no page number of the extent descriptor
  354. @param page contents of the page containing the extent descriptor.
  355. @return the start of the xdes array in a page */
  356. const xdes_t* xdes(
  357. ulint page_no,
  358. const page_t* page) const UNIV_NOTHROW
  359. {
  360. ulint offset;
  361. offset = xdes_calc_descriptor_index(get_zip_size(), page_no);
  362. return(page + XDES_ARR_OFFSET + XDES_SIZE * offset);
  363. }
  364. /** Set the current page directory (xdes). If the extent descriptor is
  365. marked as free then free the current extent descriptor and set it to
  366. 0. This implies that all pages that are covered by this extent
  367. descriptor are also freed.
  368. @param page_no offset of page within the file
  369. @param page page contents
  370. @return DB_SUCCESS or error code. */
  371. dberr_t set_current_xdes(
  372. ulint page_no,
  373. const page_t* page) UNIV_NOTHROW
  374. {
  375. m_xdes_page_no = page_no;
  376. UT_DELETE_ARRAY(m_xdes);
  377. m_xdes = NULL;
  378. if (mach_read_from_4(XDES_ARR_OFFSET + XDES_STATE + page)
  379. != XDES_FREE) {
  380. const ulint physical_size = m_zip_size
  381. ? m_zip_size : srv_page_size;
  382. m_xdes = UT_NEW_ARRAY_NOKEY(xdes_t, physical_size);
  383. /* Trigger OOM */
  384. DBUG_EXECUTE_IF(
  385. "ib_import_OOM_13",
  386. UT_DELETE_ARRAY(m_xdes);
  387. m_xdes = NULL;
  388. );
  389. if (m_xdes == NULL) {
  390. return(DB_OUT_OF_MEMORY);
  391. }
  392. memcpy(m_xdes, page, physical_size);
  393. }
  394. return(DB_SUCCESS);
  395. }
  396. /** Check if the page is marked as free in the extent descriptor.
  397. @param page_no page number to check in the extent descriptor.
  398. @return true if the page is marked as free */
  399. bool is_free(ulint page_no) const UNIV_NOTHROW
  400. {
  401. ut_a(xdes_calc_descriptor_page(get_zip_size(), page_no)
  402. == m_xdes_page_no);
  403. if (m_xdes != 0) {
  404. const xdes_t* xdesc = xdes(page_no, m_xdes);
  405. ulint pos = page_no % FSP_EXTENT_SIZE;
  406. return xdes_is_free(xdesc, pos);
  407. }
  408. /* If the current xdes was free, the page must be free. */
  409. return(true);
  410. }
  411. protected:
  412. /** The ROW_FORMAT=COMPRESSED page size, or 0. */
  413. ulint m_zip_size;
  414. /** File handle to the tablespace */
  415. pfs_os_file_t m_file;
  416. /** Physical file path. */
  417. const char* m_filepath;
  418. /** Covering transaction. */
  419. trx_t* m_trx;
  420. /** Space id of the file being iterated over. */
  421. ulint m_space;
  422. /** Minimum page number for which the free list has not been
  423. initialized: the pages >= this limit are, by definition, free;
  424. note that in a single-table tablespace where size < 64 pages,
  425. this number is 64, i.e., we have initialized the space about
  426. the first extent, but have not physically allocted those pages
  427. to the file. @see FSP_LIMIT. */
  428. ulint m_free_limit;
  429. /** Current size of the space in pages */
  430. ulint m_size;
  431. /** Current extent descriptor page */
  432. xdes_t* m_xdes;
  433. /** Physical page offset in the file of the extent descriptor */
  434. ulint m_xdes_page_no;
  435. /** Flags value read from the header page */
  436. ulint m_space_flags;
  437. };
  438. /** Determine the page size to use for traversing the tablespace
  439. @param file_size size of the tablespace file in bytes
  440. @param block contents of the first page in the tablespace file.
  441. @retval DB_SUCCESS or error code. */
  442. dberr_t
  443. AbstractCallback::init(
  444. os_offset_t file_size,
  445. const buf_block_t* block) UNIV_NOTHROW
  446. {
  447. const page_t* page = block->frame;
  448. m_space_flags = fsp_header_get_flags(page);
  449. if (!fil_space_t::is_valid_flags(m_space_flags, true)) {
  450. ulint cflags = fsp_flags_convert_from_101(m_space_flags);
  451. if (cflags == ULINT_UNDEFINED) {
  452. ib::error() << "Invalid FSP_SPACE_FLAGS="
  453. << ib::hex(m_space_flags);
  454. return(DB_CORRUPTION);
  455. }
  456. m_space_flags = cflags;
  457. }
  458. /* Clear the DATA_DIR flag, which is basically garbage. */
  459. m_space_flags &= ~(1U << FSP_FLAGS_POS_RESERVED);
  460. m_zip_size = fil_space_t::zip_size(m_space_flags);
  461. const ulint logical_size = fil_space_t::logical_size(m_space_flags);
  462. const ulint physical_size = fil_space_t::physical_size(m_space_flags);
  463. if (logical_size != srv_page_size) {
  464. ib::error() << "Page size " << logical_size
  465. << " of ibd file is not the same as the server page"
  466. " size " << srv_page_size;
  467. return(DB_CORRUPTION);
  468. } else if (file_size & (physical_size - 1)) {
  469. ib::error() << "File size " << file_size << " is not a"
  470. " multiple of the page size "
  471. << physical_size;
  472. return(DB_CORRUPTION);
  473. }
  474. m_size = mach_read_from_4(page + FSP_SIZE);
  475. m_free_limit = mach_read_from_4(page + FSP_FREE_LIMIT);
  476. if (m_space == ULINT_UNDEFINED) {
  477. m_space = mach_read_from_4(FSP_HEADER_OFFSET + FSP_SPACE_ID
  478. + page);
  479. }
  480. return set_current_xdes(0, page);
  481. }
  482. /**
  483. Try and determine the index root pages by checking if the next/prev
  484. pointers are both FIL_NULL. We need to ensure that skip deleted pages. */
  485. struct FetchIndexRootPages : public AbstractCallback {
  486. /** Index information gathered from the .ibd file. */
  487. struct Index {
  488. Index(index_id_t id, ulint page_no)
  489. :
  490. m_id(id),
  491. m_page_no(page_no) { }
  492. index_id_t m_id; /*!< Index id */
  493. ulint m_page_no; /*!< Root page number */
  494. };
  495. typedef std::vector<Index, ut_allocator<Index> > Indexes;
  496. /** Constructor
  497. @param trx covering (user) transaction
  498. @param table table definition in server .*/
  499. FetchIndexRootPages(const dict_table_t* table, trx_t* trx)
  500. :
  501. AbstractCallback(trx, ULINT_UNDEFINED),
  502. m_table(table) UNIV_NOTHROW { }
  503. /** Destructor */
  504. ~FetchIndexRootPages() UNIV_NOTHROW override { }
  505. /** Called for each block as it is read from the file.
  506. @param block block to convert, it is not from the buffer pool.
  507. @retval DB_SUCCESS or error code. */
  508. dberr_t operator()(buf_block_t* block) UNIV_NOTHROW override;
  509. /** Update the import configuration that will be used to import
  510. the tablespace. */
  511. dberr_t build_row_import(row_import* cfg) const UNIV_NOTHROW;
  512. /** Table definition in server. */
  513. const dict_table_t* m_table;
  514. /** Index information */
  515. Indexes m_indexes;
  516. };
  517. /** Called for each block as it is read from the file. Check index pages to
  518. determine the exact row format. We can't get that from the tablespace
  519. header flags alone.
  520. @param block block to convert, it is not from the buffer pool.
  521. @retval DB_SUCCESS or error code. */
  522. dberr_t FetchIndexRootPages::operator()(buf_block_t* block) UNIV_NOTHROW
  523. {
  524. if (is_interrupted()) return DB_INTERRUPTED;
  525. const page_t* page = get_frame(block);
  526. ulint page_type = fil_page_get_type(page);
  527. if (page_type == FIL_PAGE_TYPE_XDES) {
  528. return set_current_xdes(block->page.id.page_no(), page);
  529. } else if (fil_page_index_page_check(page)
  530. && !is_free(block->page.id.page_no())
  531. && !page_has_siblings(page)) {
  532. index_id_t id = btr_page_get_index_id(page);
  533. m_indexes.push_back(Index(id, block->page.id.page_no()));
  534. if (m_indexes.size() == 1) {
  535. /* Check that the tablespace flags match the table flags. */
  536. ulint expected = dict_tf_to_fsp_flags(m_table->flags);
  537. if (!fsp_flags_match(expected, m_space_flags)) {
  538. ib_errf(m_trx->mysql_thd, IB_LOG_LEVEL_ERROR,
  539. ER_TABLE_SCHEMA_MISMATCH,
  540. "Expected FSP_SPACE_FLAGS=0x%x, .ibd "
  541. "file contains 0x%x.",
  542. unsigned(expected),
  543. unsigned(m_space_flags));
  544. return(DB_CORRUPTION);
  545. }
  546. }
  547. }
  548. return DB_SUCCESS;
  549. }
  550. /**
  551. Update the import configuration that will be used to import the tablespace.
  552. @return error code or DB_SUCCESS */
  553. dberr_t
  554. FetchIndexRootPages::build_row_import(row_import* cfg) const UNIV_NOTHROW
  555. {
  556. Indexes::const_iterator end = m_indexes.end();
  557. ut_a(cfg->m_table == m_table);
  558. cfg->m_zip_size = m_zip_size;
  559. cfg->m_n_indexes = m_indexes.size();
  560. if (cfg->m_n_indexes == 0) {
  561. ib::error() << "No B+Tree found in tablespace";
  562. return(DB_CORRUPTION);
  563. }
  564. cfg->m_indexes = UT_NEW_ARRAY_NOKEY(row_index_t, cfg->m_n_indexes);
  565. /* Trigger OOM */
  566. DBUG_EXECUTE_IF(
  567. "ib_import_OOM_11",
  568. UT_DELETE_ARRAY(cfg->m_indexes);
  569. cfg->m_indexes = NULL;
  570. );
  571. if (cfg->m_indexes == NULL) {
  572. return(DB_OUT_OF_MEMORY);
  573. }
  574. memset(cfg->m_indexes, 0x0, sizeof(*cfg->m_indexes) * cfg->m_n_indexes);
  575. row_index_t* cfg_index = cfg->m_indexes;
  576. for (Indexes::const_iterator it = m_indexes.begin();
  577. it != end;
  578. ++it, ++cfg_index) {
  579. char name[BUFSIZ];
  580. snprintf(name, sizeof(name), "index" IB_ID_FMT, it->m_id);
  581. ulint len = strlen(name) + 1;
  582. cfg_index->m_name = UT_NEW_ARRAY_NOKEY(byte, len);
  583. /* Trigger OOM */
  584. DBUG_EXECUTE_IF(
  585. "ib_import_OOM_12",
  586. UT_DELETE_ARRAY(cfg_index->m_name);
  587. cfg_index->m_name = NULL;
  588. );
  589. if (cfg_index->m_name == NULL) {
  590. return(DB_OUT_OF_MEMORY);
  591. }
  592. memcpy(cfg_index->m_name, name, len);
  593. cfg_index->m_id = it->m_id;
  594. cfg_index->m_space = m_space;
  595. cfg_index->m_page_no = it->m_page_no;
  596. }
  597. return(DB_SUCCESS);
  598. }
  599. /* Functor that is called for each physical page that is read from the
  600. tablespace file.
  601. 1. Check each page for corruption.
  602. 2. Update the space id and LSN on every page
  603. * For the header page
  604. - Validate the flags
  605. - Update the LSN
  606. 3. On Btree pages
  607. * Set the index id
  608. * Update the max trx id
  609. * In a cluster index, update the system columns
  610. * In a cluster index, update the BLOB ptr, set the space id
  611. * Purge delete marked records, but only if they can be easily
  612. removed from the page
  613. * Keep a counter of number of rows, ie. non-delete-marked rows
  614. * Keep a counter of number of delete marked rows
  615. * Keep a counter of number of purge failure
  616. * If a page is stamped with an index id that isn't in the .cfg file
  617. we assume it is deleted and the page can be ignored.
  618. 4. Set the page state to dirty so that it will be written to disk.
  619. */
  620. class PageConverter : public AbstractCallback {
  621. public:
  622. /** Constructor
  623. @param cfg config of table being imported.
  624. @param space_id tablespace identifier
  625. @param trx transaction covering the import */
  626. PageConverter(row_import* cfg, ulint space_id, trx_t* trx)
  627. :
  628. AbstractCallback(trx, space_id),
  629. m_cfg(cfg),
  630. m_index(cfg->m_indexes),
  631. m_page_zip_ptr(0),
  632. m_rec_iter(),
  633. m_offsets_(), m_offsets(m_offsets_),
  634. m_heap(0),
  635. m_cluster_index(dict_table_get_first_index(cfg->m_table))
  636. {
  637. rec_offs_init(m_offsets_);
  638. }
  639. ~PageConverter() UNIV_NOTHROW override
  640. {
  641. if (m_heap != 0) {
  642. mem_heap_free(m_heap);
  643. }
  644. }
  645. /** Called for each block as it is read from the file.
  646. @param block block to convert, it is not from the buffer pool.
  647. @retval DB_SUCCESS or error code. */
  648. dberr_t operator()(buf_block_t* block) UNIV_NOTHROW override;
  649. private:
  650. /** Update the page, set the space id, max trx id and index id.
  651. @param block block read from file
  652. @param page_type type of the page
  653. @retval DB_SUCCESS or error code */
  654. dberr_t update_page(buf_block_t* block, uint16_t& page_type)
  655. UNIV_NOTHROW;
  656. /** Update the space, index id, trx id.
  657. @param block block to convert
  658. @return DB_SUCCESS or error code */
  659. dberr_t update_index_page(buf_block_t* block) UNIV_NOTHROW;
  660. /** Update the BLOB refrences and write UNDO log entries for
  661. rows that can't be purged optimistically.
  662. @param block block to update
  663. @retval DB_SUCCESS or error code */
  664. dberr_t update_records(buf_block_t* block) UNIV_NOTHROW;
  665. /** Validate the space flags and update tablespace header page.
  666. @param block block read from file, not from the buffer pool.
  667. @retval DB_SUCCESS or error code */
  668. dberr_t update_header(buf_block_t* block) UNIV_NOTHROW;
  669. /** Adjust the BLOB reference for a single column that is externally stored
  670. @param rec record to update
  671. @param offsets column offsets for the record
  672. @param i column ordinal value
  673. @return DB_SUCCESS or error code */
  674. dberr_t adjust_cluster_index_blob_column(
  675. rec_t* rec,
  676. const ulint* offsets,
  677. ulint i) UNIV_NOTHROW;
  678. /** Adjusts the BLOB reference in the clustered index row for all
  679. externally stored columns.
  680. @param rec record to update
  681. @param offsets column offsets for the record
  682. @return DB_SUCCESS or error code */
  683. dberr_t adjust_cluster_index_blob_columns(
  684. rec_t* rec,
  685. const ulint* offsets) UNIV_NOTHROW;
  686. /** In the clustered index, adjist the BLOB pointers as needed.
  687. Also update the BLOB reference, write the new space id.
  688. @param rec record to update
  689. @param offsets column offsets for the record
  690. @return DB_SUCCESS or error code */
  691. dberr_t adjust_cluster_index_blob_ref(
  692. rec_t* rec,
  693. const ulint* offsets) UNIV_NOTHROW;
  694. /** Purge delete-marked records, only if it is possible to do
  695. so without re-organising the B+tree.
  696. @retval true if purged */
  697. bool purge() UNIV_NOTHROW;
  698. /** Adjust the BLOB references and sys fields for the current record.
  699. @param rec record to update
  700. @param offsets column offsets for the record
  701. @return DB_SUCCESS or error code. */
  702. dberr_t adjust_cluster_record(
  703. rec_t* rec,
  704. const ulint* offsets) UNIV_NOTHROW;
  705. /** Find an index with the matching id.
  706. @return row_index_t* instance or 0 */
  707. row_index_t* find_index(index_id_t id) UNIV_NOTHROW
  708. {
  709. row_index_t* index = &m_cfg->m_indexes[0];
  710. for (ulint i = 0; i < m_cfg->m_n_indexes; ++i, ++index) {
  711. if (id == index->m_id) {
  712. return(index);
  713. }
  714. }
  715. return(0);
  716. }
  717. private:
  718. /** Config for table that is being imported. */
  719. row_import* m_cfg;
  720. /** Current index whose pages are being imported */
  721. row_index_t* m_index;
  722. /** Alias for m_page_zip, only set for compressed pages. */
  723. page_zip_des_t* m_page_zip_ptr;
  724. /** Iterator over records in a block */
  725. RecIterator m_rec_iter;
  726. /** Record offset */
  727. ulint m_offsets_[REC_OFFS_NORMAL_SIZE];
  728. /** Pointer to m_offsets_ */
  729. ulint* m_offsets;
  730. /** Memory heap for the record offsets */
  731. mem_heap_t* m_heap;
  732. /** Cluster index instance */
  733. dict_index_t* m_cluster_index;
  734. };
  735. /**
  736. row_import destructor. */
  737. row_import::~row_import() UNIV_NOTHROW
  738. {
  739. for (ulint i = 0; m_indexes != 0 && i < m_n_indexes; ++i) {
  740. UT_DELETE_ARRAY(m_indexes[i].m_name);
  741. if (m_indexes[i].m_fields == NULL) {
  742. continue;
  743. }
  744. dict_field_t* fields = m_indexes[i].m_fields;
  745. ulint n_fields = m_indexes[i].m_n_fields;
  746. for (ulint j = 0; j < n_fields; ++j) {
  747. UT_DELETE_ARRAY(const_cast<char*>(fields[j].name()));
  748. }
  749. UT_DELETE_ARRAY(fields);
  750. }
  751. for (ulint i = 0; m_col_names != 0 && i < m_n_cols; ++i) {
  752. UT_DELETE_ARRAY(m_col_names[i]);
  753. }
  754. UT_DELETE_ARRAY(m_cols);
  755. UT_DELETE_ARRAY(m_indexes);
  756. UT_DELETE_ARRAY(m_col_names);
  757. UT_DELETE_ARRAY(m_table_name);
  758. UT_DELETE_ARRAY(m_hostname);
  759. }
  760. /** Find the index entry in in the indexes array.
  761. @param name index name
  762. @return instance if found else 0. */
  763. row_index_t*
  764. row_import::get_index(
  765. const char* name) const UNIV_NOTHROW
  766. {
  767. for (ulint i = 0; i < m_n_indexes; ++i) {
  768. const char* index_name;
  769. row_index_t* index = &m_indexes[i];
  770. index_name = reinterpret_cast<const char*>(index->m_name);
  771. if (strcmp(index_name, name) == 0) {
  772. return(index);
  773. }
  774. }
  775. return(0);
  776. }
  777. /** Get the number of rows in the index.
  778. @param name index name
  779. @return number of rows (doesn't include delete marked rows). */
  780. ulint
  781. row_import::get_n_rows(
  782. const char* name) const UNIV_NOTHROW
  783. {
  784. const row_index_t* index = get_index(name);
  785. ut_a(name != 0);
  786. return(index->m_stats.m_n_rows);
  787. }
  788. /** Get the number of rows for which purge failed uding the convert phase.
  789. @param name index name
  790. @return number of rows for which purge failed. */
  791. ulint
  792. row_import::get_n_purge_failed(
  793. const char* name) const UNIV_NOTHROW
  794. {
  795. const row_index_t* index = get_index(name);
  796. ut_a(name != 0);
  797. return(index->m_stats.m_n_purge_failed);
  798. }
  799. /** Find the ordinal value of the column name in the cfg table columns.
  800. @param name of column to look for.
  801. @return ULINT_UNDEFINED if not found. */
  802. ulint
  803. row_import::find_col(
  804. const char* name) const UNIV_NOTHROW
  805. {
  806. for (ulint i = 0; i < m_n_cols; ++i) {
  807. const char* col_name;
  808. col_name = reinterpret_cast<const char*>(m_col_names[i]);
  809. if (strcmp(col_name, name) == 0) {
  810. return(i);
  811. }
  812. }
  813. return(ULINT_UNDEFINED);
  814. }
  815. /**
  816. Check if the index schema that was read from the .cfg file matches the
  817. in memory index definition.
  818. @return DB_SUCCESS or error code. */
  819. dberr_t
  820. row_import::match_index_columns(
  821. THD* thd,
  822. const dict_index_t* index) UNIV_NOTHROW
  823. {
  824. row_index_t* cfg_index;
  825. dberr_t err = DB_SUCCESS;
  826. cfg_index = get_index(index->name);
  827. if (cfg_index == 0) {
  828. ib_errf(thd, IB_LOG_LEVEL_ERROR,
  829. ER_TABLE_SCHEMA_MISMATCH,
  830. "Index %s not found in tablespace meta-data file.",
  831. index->name());
  832. return(DB_ERROR);
  833. }
  834. if (cfg_index->m_n_fields != index->n_fields) {
  835. ib_errf(thd, IB_LOG_LEVEL_ERROR,
  836. ER_TABLE_SCHEMA_MISMATCH,
  837. "Index field count %u doesn't match"
  838. " tablespace metadata file value " ULINTPF,
  839. index->n_fields, cfg_index->m_n_fields);
  840. return(DB_ERROR);
  841. }
  842. cfg_index->m_srv_index = index;
  843. const dict_field_t* field = index->fields;
  844. const dict_field_t* cfg_field = cfg_index->m_fields;
  845. for (ulint i = 0; i < index->n_fields; ++i, ++field, ++cfg_field) {
  846. if (strcmp(field->name(), cfg_field->name()) != 0) {
  847. ib_errf(thd, IB_LOG_LEVEL_ERROR,
  848. ER_TABLE_SCHEMA_MISMATCH,
  849. "Index field name %s doesn't match"
  850. " tablespace metadata field name %s"
  851. " for field position " ULINTPF,
  852. field->name(), cfg_field->name(), i);
  853. err = DB_ERROR;
  854. }
  855. if (cfg_field->prefix_len != field->prefix_len) {
  856. ib_errf(thd, IB_LOG_LEVEL_ERROR,
  857. ER_TABLE_SCHEMA_MISMATCH,
  858. "Index %s field %s prefix len %u"
  859. " doesn't match metadata file value %u",
  860. index->name(), field->name(),
  861. field->prefix_len, cfg_field->prefix_len);
  862. err = DB_ERROR;
  863. }
  864. if (cfg_field->fixed_len != field->fixed_len) {
  865. ib_errf(thd, IB_LOG_LEVEL_ERROR,
  866. ER_TABLE_SCHEMA_MISMATCH,
  867. "Index %s field %s fixed len %u"
  868. " doesn't match metadata file value %u",
  869. index->name(), field->name(),
  870. field->fixed_len,
  871. cfg_field->fixed_len);
  872. err = DB_ERROR;
  873. }
  874. }
  875. return(err);
  876. }
  877. /** Check if the table schema that was read from the .cfg file matches the
  878. in memory table definition.
  879. @param thd MySQL session variable
  880. @return DB_SUCCESS or error code. */
  881. dberr_t
  882. row_import::match_table_columns(
  883. THD* thd) UNIV_NOTHROW
  884. {
  885. dberr_t err = DB_SUCCESS;
  886. const dict_col_t* col = m_table->cols;
  887. for (ulint i = 0; i < m_table->n_cols; ++i, ++col) {
  888. const char* col_name;
  889. ulint cfg_col_index;
  890. col_name = dict_table_get_col_name(
  891. m_table, dict_col_get_no(col));
  892. cfg_col_index = find_col(col_name);
  893. if (cfg_col_index == ULINT_UNDEFINED) {
  894. ib_errf(thd, IB_LOG_LEVEL_ERROR,
  895. ER_TABLE_SCHEMA_MISMATCH,
  896. "Column %s not found in tablespace.",
  897. col_name);
  898. err = DB_ERROR;
  899. } else if (cfg_col_index != col->ind) {
  900. ib_errf(thd, IB_LOG_LEVEL_ERROR,
  901. ER_TABLE_SCHEMA_MISMATCH,
  902. "Column %s ordinal value mismatch, it's at %u"
  903. " in the table and " ULINTPF
  904. " in the tablespace meta-data file",
  905. col_name, col->ind, cfg_col_index);
  906. err = DB_ERROR;
  907. } else {
  908. const dict_col_t* cfg_col;
  909. cfg_col = &m_cols[cfg_col_index];
  910. ut_a(cfg_col->ind == cfg_col_index);
  911. if (cfg_col->prtype != col->prtype) {
  912. ib_errf(thd,
  913. IB_LOG_LEVEL_ERROR,
  914. ER_TABLE_SCHEMA_MISMATCH,
  915. "Column %s precise type mismatch.",
  916. col_name);
  917. err = DB_ERROR;
  918. }
  919. if (cfg_col->mtype != col->mtype) {
  920. ib_errf(thd,
  921. IB_LOG_LEVEL_ERROR,
  922. ER_TABLE_SCHEMA_MISMATCH,
  923. "Column %s main type mismatch.",
  924. col_name);
  925. err = DB_ERROR;
  926. }
  927. if (cfg_col->len != col->len) {
  928. ib_errf(thd,
  929. IB_LOG_LEVEL_ERROR,
  930. ER_TABLE_SCHEMA_MISMATCH,
  931. "Column %s length mismatch.",
  932. col_name);
  933. err = DB_ERROR;
  934. }
  935. if (cfg_col->mbminlen != col->mbminlen
  936. || cfg_col->mbmaxlen != col->mbmaxlen) {
  937. ib_errf(thd,
  938. IB_LOG_LEVEL_ERROR,
  939. ER_TABLE_SCHEMA_MISMATCH,
  940. "Column %s multi-byte len mismatch.",
  941. col_name);
  942. err = DB_ERROR;
  943. }
  944. if (cfg_col->ind != col->ind) {
  945. err = DB_ERROR;
  946. }
  947. if (cfg_col->ord_part != col->ord_part) {
  948. ib_errf(thd,
  949. IB_LOG_LEVEL_ERROR,
  950. ER_TABLE_SCHEMA_MISMATCH,
  951. "Column %s ordering mismatch.",
  952. col_name);
  953. err = DB_ERROR;
  954. }
  955. if (cfg_col->max_prefix != col->max_prefix) {
  956. ib_errf(thd,
  957. IB_LOG_LEVEL_ERROR,
  958. ER_TABLE_SCHEMA_MISMATCH,
  959. "Column %s max prefix mismatch.",
  960. col_name);
  961. err = DB_ERROR;
  962. }
  963. }
  964. }
  965. return(err);
  966. }
  967. /** Check if the table (and index) schema that was read from the .cfg file
  968. matches the in memory table definition.
  969. @param thd MySQL session variable
  970. @return DB_SUCCESS or error code. */
  971. dberr_t
  972. row_import::match_schema(
  973. THD* thd) UNIV_NOTHROW
  974. {
  975. /* Do some simple checks. */
  976. if (ulint mismatch = (m_table->flags ^ m_flags)
  977. & ~DICT_TF_MASK_DATA_DIR) {
  978. const char* msg;
  979. if (mismatch & DICT_TF_MASK_ZIP_SSIZE) {
  980. if ((m_table->flags & DICT_TF_MASK_ZIP_SSIZE)
  981. && (m_flags & DICT_TF_MASK_ZIP_SSIZE)) {
  982. switch (m_flags & DICT_TF_MASK_ZIP_SSIZE) {
  983. case 0U << DICT_TF_POS_ZIP_SSIZE:
  984. goto uncompressed;
  985. case 1U << DICT_TF_POS_ZIP_SSIZE:
  986. msg = "ROW_FORMAT=COMPRESSED"
  987. " KEY_BLOCK_SIZE=1";
  988. break;
  989. case 2U << DICT_TF_POS_ZIP_SSIZE:
  990. msg = "ROW_FORMAT=COMPRESSED"
  991. " KEY_BLOCK_SIZE=2";
  992. break;
  993. case 3U << DICT_TF_POS_ZIP_SSIZE:
  994. msg = "ROW_FORMAT=COMPRESSED"
  995. " KEY_BLOCK_SIZE=4";
  996. break;
  997. case 4U << DICT_TF_POS_ZIP_SSIZE:
  998. msg = "ROW_FORMAT=COMPRESSED"
  999. " KEY_BLOCK_SIZE=8";
  1000. break;
  1001. case 5U << DICT_TF_POS_ZIP_SSIZE:
  1002. msg = "ROW_FORMAT=COMPRESSED"
  1003. " KEY_BLOCK_SIZE=16";
  1004. break;
  1005. default:
  1006. msg = "strange KEY_BLOCK_SIZE";
  1007. }
  1008. } else if (m_flags & DICT_TF_MASK_ZIP_SSIZE) {
  1009. msg = "ROW_FORMAT=COMPRESSED";
  1010. } else {
  1011. goto uncompressed;
  1012. }
  1013. } else {
  1014. uncompressed:
  1015. msg = (m_flags & DICT_TF_MASK_ATOMIC_BLOBS)
  1016. ? "ROW_FORMAT=DYNAMIC"
  1017. : (m_flags & DICT_TF_MASK_COMPACT)
  1018. ? "ROW_FORMAT=COMPACT"
  1019. : "ROW_FORMAT=REDUNDANT";
  1020. }
  1021. ib_errf(thd, IB_LOG_LEVEL_ERROR, ER_TABLE_SCHEMA_MISMATCH,
  1022. "Table flags don't match, server table has 0x%x"
  1023. " and the meta-data file has 0x" ULINTPFx ";"
  1024. " .cfg file uses %s",
  1025. m_table->flags, m_flags, msg);
  1026. return(DB_ERROR);
  1027. } else if (m_table->n_cols != m_n_cols) {
  1028. ib_errf(thd, IB_LOG_LEVEL_ERROR, ER_TABLE_SCHEMA_MISMATCH,
  1029. "Number of columns don't match, table has %u "
  1030. "columns but the tablespace meta-data file has "
  1031. ULINTPF " columns",
  1032. m_table->n_cols, m_n_cols);
  1033. return(DB_ERROR);
  1034. } else if (UT_LIST_GET_LEN(m_table->indexes) != m_n_indexes) {
  1035. /* If the number of indexes don't match then it is better
  1036. to abort the IMPORT. It is easy for the user to create a
  1037. table matching the IMPORT definition. */
  1038. ib_errf(thd, IB_LOG_LEVEL_ERROR, ER_TABLE_SCHEMA_MISMATCH,
  1039. "Number of indexes don't match, table has " ULINTPF
  1040. " indexes but the tablespace meta-data file has "
  1041. ULINTPF " indexes",
  1042. UT_LIST_GET_LEN(m_table->indexes), m_n_indexes);
  1043. return(DB_ERROR);
  1044. }
  1045. dberr_t err = match_table_columns(thd);
  1046. if (err != DB_SUCCESS) {
  1047. return(err);
  1048. }
  1049. /* Check if the index definitions match. */
  1050. const dict_index_t* index;
  1051. for (index = UT_LIST_GET_FIRST(m_table->indexes);
  1052. index != 0;
  1053. index = UT_LIST_GET_NEXT(indexes, index)) {
  1054. dberr_t index_err;
  1055. index_err = match_index_columns(thd, index);
  1056. if (index_err != DB_SUCCESS) {
  1057. err = index_err;
  1058. }
  1059. }
  1060. return(err);
  1061. }
  1062. /**
  1063. Set the index root <space, pageno>, using index name. */
  1064. void
  1065. row_import::set_root_by_name() UNIV_NOTHROW
  1066. {
  1067. row_index_t* cfg_index = m_indexes;
  1068. for (ulint i = 0; i < m_n_indexes; ++i, ++cfg_index) {
  1069. dict_index_t* index;
  1070. const char* index_name;
  1071. index_name = reinterpret_cast<const char*>(cfg_index->m_name);
  1072. index = dict_table_get_index_on_name(m_table, index_name);
  1073. /* We've already checked that it exists. */
  1074. ut_a(index != 0);
  1075. index->page = cfg_index->m_page_no;
  1076. }
  1077. }
  1078. /**
  1079. Set the index root <space, pageno>, using a heuristic.
  1080. @return DB_SUCCESS or error code */
  1081. dberr_t
  1082. row_import::set_root_by_heuristic() UNIV_NOTHROW
  1083. {
  1084. row_index_t* cfg_index = m_indexes;
  1085. ut_a(m_n_indexes > 0);
  1086. // TODO: For now use brute force, based on ordinality
  1087. if (UT_LIST_GET_LEN(m_table->indexes) != m_n_indexes) {
  1088. ib::warn() << "Table " << m_table->name << " should have "
  1089. << UT_LIST_GET_LEN(m_table->indexes) << " indexes but"
  1090. " the tablespace has " << m_n_indexes << " indexes";
  1091. }
  1092. dict_mutex_enter_for_mysql();
  1093. ulint i = 0;
  1094. dberr_t err = DB_SUCCESS;
  1095. for (dict_index_t* index = UT_LIST_GET_FIRST(m_table->indexes);
  1096. index != 0;
  1097. index = UT_LIST_GET_NEXT(indexes, index)) {
  1098. if (index->type & DICT_FTS) {
  1099. index->type |= DICT_CORRUPT;
  1100. ib::warn() << "Skipping FTS index: " << index->name;
  1101. } else if (i < m_n_indexes) {
  1102. UT_DELETE_ARRAY(cfg_index[i].m_name);
  1103. ulint len = strlen(index->name) + 1;
  1104. cfg_index[i].m_name = UT_NEW_ARRAY_NOKEY(byte, len);
  1105. /* Trigger OOM */
  1106. DBUG_EXECUTE_IF(
  1107. "ib_import_OOM_14",
  1108. UT_DELETE_ARRAY(cfg_index[i].m_name);
  1109. cfg_index[i].m_name = NULL;
  1110. );
  1111. if (cfg_index[i].m_name == NULL) {
  1112. err = DB_OUT_OF_MEMORY;
  1113. break;
  1114. }
  1115. memcpy(cfg_index[i].m_name, index->name, len);
  1116. cfg_index[i].m_srv_index = index;
  1117. index->page = cfg_index[i].m_page_no;
  1118. ++i;
  1119. }
  1120. }
  1121. dict_mutex_exit_for_mysql();
  1122. return(err);
  1123. }
  1124. /**
  1125. Purge delete marked records.
  1126. @return DB_SUCCESS or error code. */
  1127. dberr_t
  1128. IndexPurge::garbage_collect() UNIV_NOTHROW
  1129. {
  1130. dberr_t err;
  1131. ibool comp = dict_table_is_comp(m_index->table);
  1132. /* Open the persistent cursor and start the mini-transaction. */
  1133. open();
  1134. while ((err = next()) == DB_SUCCESS) {
  1135. rec_t* rec = btr_pcur_get_rec(&m_pcur);
  1136. ibool deleted = rec_get_deleted_flag(rec, comp);
  1137. if (!deleted) {
  1138. ++m_n_rows;
  1139. } else {
  1140. purge();
  1141. }
  1142. }
  1143. /* Close the persistent cursor and commit the mini-transaction. */
  1144. close();
  1145. return(err == DB_END_OF_INDEX ? DB_SUCCESS : err);
  1146. }
  1147. /**
  1148. Begin import, position the cursor on the first record. */
  1149. void
  1150. IndexPurge::open() UNIV_NOTHROW
  1151. {
  1152. mtr_start(&m_mtr);
  1153. mtr_set_log_mode(&m_mtr, MTR_LOG_NO_REDO);
  1154. btr_pcur_open_at_index_side(
  1155. true, m_index, BTR_MODIFY_LEAF, &m_pcur, true, 0, &m_mtr);
  1156. btr_pcur_move_to_next_user_rec(&m_pcur, &m_mtr);
  1157. if (rec_is_metadata(btr_pcur_get_rec(&m_pcur), *m_index)) {
  1158. ut_ad(btr_pcur_is_on_user_rec(&m_pcur));
  1159. /* Skip the metadata pseudo-record. */
  1160. } else {
  1161. btr_pcur_move_to_prev_on_page(&m_pcur);
  1162. }
  1163. }
  1164. /**
  1165. Close the persistent curosr and commit the mini-transaction. */
  1166. void
  1167. IndexPurge::close() UNIV_NOTHROW
  1168. {
  1169. btr_pcur_close(&m_pcur);
  1170. mtr_commit(&m_mtr);
  1171. }
  1172. /**
  1173. Position the cursor on the next record.
  1174. @return DB_SUCCESS or error code */
  1175. dberr_t
  1176. IndexPurge::next() UNIV_NOTHROW
  1177. {
  1178. btr_pcur_move_to_next_on_page(&m_pcur);
  1179. /* When switching pages, commit the mini-transaction
  1180. in order to release the latch on the old page. */
  1181. if (!btr_pcur_is_after_last_on_page(&m_pcur)) {
  1182. return(DB_SUCCESS);
  1183. } else if (trx_is_interrupted(m_trx)) {
  1184. /* Check after every page because the check
  1185. is expensive. */
  1186. return(DB_INTERRUPTED);
  1187. }
  1188. btr_pcur_store_position(&m_pcur, &m_mtr);
  1189. mtr_commit(&m_mtr);
  1190. mtr_start(&m_mtr);
  1191. mtr_set_log_mode(&m_mtr, MTR_LOG_NO_REDO);
  1192. btr_pcur_restore_position(BTR_MODIFY_LEAF, &m_pcur, &m_mtr);
  1193. if (!btr_pcur_move_to_next_user_rec(&m_pcur, &m_mtr)) {
  1194. return(DB_END_OF_INDEX);
  1195. }
  1196. return(DB_SUCCESS);
  1197. }
  1198. /**
  1199. Store the persistent cursor position and reopen the
  1200. B-tree cursor in BTR_MODIFY_TREE mode, because the
  1201. tree structure may be changed during a pessimistic delete. */
  1202. void
  1203. IndexPurge::purge_pessimistic_delete() UNIV_NOTHROW
  1204. {
  1205. dberr_t err;
  1206. btr_pcur_restore_position(BTR_MODIFY_TREE | BTR_LATCH_FOR_DELETE,
  1207. &m_pcur, &m_mtr);
  1208. ut_ad(rec_get_deleted_flag(
  1209. btr_pcur_get_rec(&m_pcur),
  1210. dict_table_is_comp(m_index->table)));
  1211. btr_cur_pessimistic_delete(
  1212. &err, FALSE, btr_pcur_get_btr_cur(&m_pcur), 0, false, &m_mtr);
  1213. ut_a(err == DB_SUCCESS);
  1214. /* Reopen the B-tree cursor in BTR_MODIFY_LEAF mode */
  1215. mtr_commit(&m_mtr);
  1216. }
  1217. /**
  1218. Purge delete-marked records. */
  1219. void
  1220. IndexPurge::purge() UNIV_NOTHROW
  1221. {
  1222. btr_pcur_store_position(&m_pcur, &m_mtr);
  1223. purge_pessimistic_delete();
  1224. mtr_start(&m_mtr);
  1225. mtr_set_log_mode(&m_mtr, MTR_LOG_NO_REDO);
  1226. btr_pcur_restore_position(BTR_MODIFY_LEAF, &m_pcur, &m_mtr);
  1227. }
  1228. /** Adjust the BLOB reference for a single column that is externally stored
  1229. @param rec record to update
  1230. @param offsets column offsets for the record
  1231. @param i column ordinal value
  1232. @return DB_SUCCESS or error code */
  1233. inline
  1234. dberr_t
  1235. PageConverter::adjust_cluster_index_blob_column(
  1236. rec_t* rec,
  1237. const ulint* offsets,
  1238. ulint i) UNIV_NOTHROW
  1239. {
  1240. ulint len;
  1241. byte* field;
  1242. field = rec_get_nth_field(rec, offsets, i, &len);
  1243. DBUG_EXECUTE_IF("ib_import_trigger_corruption_2",
  1244. len = BTR_EXTERN_FIELD_REF_SIZE - 1;);
  1245. if (len < BTR_EXTERN_FIELD_REF_SIZE) {
  1246. ib_errf(m_trx->mysql_thd, IB_LOG_LEVEL_ERROR,
  1247. ER_INNODB_INDEX_CORRUPT,
  1248. "Externally stored column(" ULINTPF
  1249. ") has a reference length of " ULINTPF
  1250. " in the cluster index %s",
  1251. i, len, m_cluster_index->name());
  1252. return(DB_CORRUPTION);
  1253. }
  1254. field += len - (BTR_EXTERN_FIELD_REF_SIZE - BTR_EXTERN_SPACE_ID);
  1255. mach_write_to_4(field, get_space_id());
  1256. if (m_page_zip_ptr) {
  1257. page_zip_write_blob_ptr(
  1258. m_page_zip_ptr, rec, m_cluster_index, offsets, i, 0);
  1259. }
  1260. return(DB_SUCCESS);
  1261. }
  1262. /** Adjusts the BLOB reference in the clustered index row for all externally
  1263. stored columns.
  1264. @param rec record to update
  1265. @param offsets column offsets for the record
  1266. @return DB_SUCCESS or error code */
  1267. inline
  1268. dberr_t
  1269. PageConverter::adjust_cluster_index_blob_columns(
  1270. rec_t* rec,
  1271. const ulint* offsets) UNIV_NOTHROW
  1272. {
  1273. ut_ad(rec_offs_any_extern(offsets));
  1274. /* Adjust the space_id in the BLOB pointers. */
  1275. for (ulint i = 0; i < rec_offs_n_fields(offsets); ++i) {
  1276. /* Only if the column is stored "externally". */
  1277. if (rec_offs_nth_extern(offsets, i)) {
  1278. dberr_t err;
  1279. err = adjust_cluster_index_blob_column(rec, offsets, i);
  1280. if (err != DB_SUCCESS) {
  1281. return(err);
  1282. }
  1283. }
  1284. }
  1285. return(DB_SUCCESS);
  1286. }
  1287. /** In the clustered index, adjust BLOB pointers as needed. Also update the
  1288. BLOB reference, write the new space id.
  1289. @param rec record to update
  1290. @param offsets column offsets for the record
  1291. @return DB_SUCCESS or error code */
  1292. inline
  1293. dberr_t
  1294. PageConverter::adjust_cluster_index_blob_ref(
  1295. rec_t* rec,
  1296. const ulint* offsets) UNIV_NOTHROW
  1297. {
  1298. if (rec_offs_any_extern(offsets)) {
  1299. dberr_t err;
  1300. err = adjust_cluster_index_blob_columns(rec, offsets);
  1301. if (err != DB_SUCCESS) {
  1302. return(err);
  1303. }
  1304. }
  1305. return(DB_SUCCESS);
  1306. }
  1307. /** Purge delete-marked records, only if it is possible to do so without
  1308. re-organising the B+tree.
  1309. @return true if purge succeeded */
  1310. inline bool PageConverter::purge() UNIV_NOTHROW
  1311. {
  1312. const dict_index_t* index = m_index->m_srv_index;
  1313. /* We can't have a page that is empty and not root. */
  1314. if (m_rec_iter.remove(index, m_page_zip_ptr, m_offsets)) {
  1315. ++m_index->m_stats.m_n_purged;
  1316. return(true);
  1317. } else {
  1318. ++m_index->m_stats.m_n_purge_failed;
  1319. }
  1320. return(false);
  1321. }
  1322. /** Adjust the BLOB references and sys fields for the current record.
  1323. @param rec record to update
  1324. @param offsets column offsets for the record
  1325. @return DB_SUCCESS or error code. */
  1326. inline
  1327. dberr_t
  1328. PageConverter::adjust_cluster_record(
  1329. rec_t* rec,
  1330. const ulint* offsets) UNIV_NOTHROW
  1331. {
  1332. dberr_t err;
  1333. if ((err = adjust_cluster_index_blob_ref(rec, offsets)) == DB_SUCCESS) {
  1334. /* Reset DB_TRX_ID and DB_ROLL_PTR. Normally, these fields
  1335. are only written in conjunction with other changes to the
  1336. record. */
  1337. ulint trx_id_pos = m_cluster_index->n_uniq
  1338. ? m_cluster_index->n_uniq : 1;
  1339. if (m_page_zip_ptr) {
  1340. page_zip_write_trx_id_and_roll_ptr(
  1341. m_page_zip_ptr, rec, m_offsets, trx_id_pos,
  1342. 0, roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS,
  1343. NULL);
  1344. } else {
  1345. ulint len;
  1346. byte* ptr = rec_get_nth_field(
  1347. rec, m_offsets, trx_id_pos, &len);
  1348. ut_ad(len == DATA_TRX_ID_LEN);
  1349. memcpy(ptr, reset_trx_id, sizeof reset_trx_id);
  1350. }
  1351. }
  1352. return(err);
  1353. }
  1354. /** Update the BLOB refrences and write UNDO log entries for
  1355. rows that can't be purged optimistically.
  1356. @param block block to update
  1357. @retval DB_SUCCESS or error code */
  1358. inline
  1359. dberr_t
  1360. PageConverter::update_records(
  1361. buf_block_t* block) UNIV_NOTHROW
  1362. {
  1363. ibool comp = dict_table_is_comp(m_cfg->m_table);
  1364. bool clust_index = m_index->m_srv_index == m_cluster_index;
  1365. /* This will also position the cursor on the first user record. */
  1366. m_rec_iter.open(block);
  1367. while (!m_rec_iter.end()) {
  1368. rec_t* rec = m_rec_iter.current();
  1369. ibool deleted = rec_get_deleted_flag(rec, comp);
  1370. /* For the clustered index we have to adjust the BLOB
  1371. reference and the system fields irrespective of the
  1372. delete marked flag. The adjustment of delete marked
  1373. cluster records is required for purge to work later. */
  1374. if (deleted || clust_index) {
  1375. m_offsets = rec_get_offsets(
  1376. rec, m_index->m_srv_index, m_offsets, true,
  1377. ULINT_UNDEFINED, &m_heap);
  1378. }
  1379. if (clust_index) {
  1380. dberr_t err = adjust_cluster_record(rec, m_offsets);
  1381. if (err != DB_SUCCESS) {
  1382. return(err);
  1383. }
  1384. }
  1385. /* If it is a delete marked record then try an
  1386. optimistic delete. */
  1387. if (deleted) {
  1388. /* A successful purge will move the cursor to the
  1389. next record. */
  1390. if (!purge()) {
  1391. m_rec_iter.next();
  1392. }
  1393. ++m_index->m_stats.m_n_deleted;
  1394. } else {
  1395. ++m_index->m_stats.m_n_rows;
  1396. m_rec_iter.next();
  1397. }
  1398. }
  1399. return(DB_SUCCESS);
  1400. }
  1401. /** Update the space, index id, trx id.
  1402. @return DB_SUCCESS or error code */
  1403. inline
  1404. dberr_t
  1405. PageConverter::update_index_page(
  1406. buf_block_t* block) UNIV_NOTHROW
  1407. {
  1408. index_id_t id;
  1409. buf_frame_t* page = block->frame;
  1410. if (is_free(block->page.id.page_no())) {
  1411. return(DB_SUCCESS);
  1412. } else if ((id = btr_page_get_index_id(page)) != m_index->m_id) {
  1413. row_index_t* index = find_index(id);
  1414. if (index == 0) {
  1415. ib::error() << "Page for tablespace " << m_space
  1416. << " is index page with id " << id
  1417. << " but that index is not found from"
  1418. << " configuration file. Current index name "
  1419. << m_index->m_name << " and id " << m_index->m_id;
  1420. m_index = 0;
  1421. return(DB_CORRUPTION);
  1422. }
  1423. /* Update current index */
  1424. m_index = index;
  1425. }
  1426. /* If the .cfg file is missing and there is an index mismatch
  1427. then ignore the error. */
  1428. if (m_cfg->m_missing && (m_index == 0 || m_index->m_srv_index == 0)) {
  1429. return(DB_SUCCESS);
  1430. }
  1431. #ifdef UNIV_ZIP_DEBUG
  1432. ut_a(!is_compressed_table()
  1433. || page_zip_validate(m_page_zip_ptr, page, m_index->m_srv_index));
  1434. #endif /* UNIV_ZIP_DEBUG */
  1435. /* This has to be written to uncompressed index header. Set it to
  1436. the current index id. */
  1437. mach_write_to_8(page + (PAGE_HEADER + PAGE_INDEX_ID),
  1438. m_index->m_srv_index->id);
  1439. if (m_page_zip_ptr) {
  1440. memcpy(&m_page_zip_ptr->data[PAGE_HEADER + PAGE_INDEX_ID],
  1441. &block->frame[PAGE_HEADER + PAGE_INDEX_ID], 8);
  1442. }
  1443. if (m_index->m_srv_index->is_clust()) {
  1444. if (block->page.id.page_no() == m_index->m_srv_index->page) {
  1445. dict_index_t* index = const_cast<dict_index_t*>(
  1446. m_index->m_srv_index);
  1447. /* Preserve the PAGE_ROOT_AUTO_INC. */
  1448. if (index->table->supports_instant()) {
  1449. if (btr_cur_instant_root_init(index, page)) {
  1450. return(DB_CORRUPTION);
  1451. }
  1452. if (index->n_core_fields > index->n_fields) {
  1453. /* Some columns have been dropped.
  1454. Refuse to IMPORT TABLESPACE for now.
  1455. NOTE: This is not an accurate check.
  1456. Columns could have been both
  1457. added and dropped instantly.
  1458. For an accurate check, we must read
  1459. the metadata BLOB page pointed to
  1460. by the leftmost leaf page.
  1461. But we would have to read
  1462. those pages in a special way,
  1463. bypassing the buffer pool! */
  1464. return DB_UNSUPPORTED;
  1465. }
  1466. /* Provisionally set all instantly
  1467. added columns to be DEFAULT NULL. */
  1468. for (unsigned i = index->n_core_fields;
  1469. i < index->n_fields; i++) {
  1470. dict_col_t* col = index->fields[i].col;
  1471. col->def_val.len = UNIV_SQL_NULL;
  1472. col->def_val.data = NULL;
  1473. }
  1474. }
  1475. } else {
  1476. goto clear_page_max_trx_id;
  1477. }
  1478. } else if (page_is_leaf(page)) {
  1479. /* Set PAGE_MAX_TRX_ID on secondary index leaf pages. */
  1480. mach_write_to_8(&block->frame[PAGE_HEADER + PAGE_MAX_TRX_ID],
  1481. m_trx->id);
  1482. if (m_page_zip_ptr) {
  1483. memcpy(&m_page_zip_ptr
  1484. ->data[PAGE_HEADER + PAGE_MAX_TRX_ID],
  1485. &block->frame[PAGE_HEADER + PAGE_MAX_TRX_ID],
  1486. 8);
  1487. }
  1488. } else {
  1489. clear_page_max_trx_id:
  1490. /* Clear PAGE_MAX_TRX_ID so that it can be
  1491. used for other purposes in the future. IMPORT
  1492. in MySQL 5.6, 5.7 and MariaDB 10.0 and 10.1
  1493. would set the field to the transaction ID even
  1494. on clustered index pages. */
  1495. memset(&block->frame[PAGE_HEADER + PAGE_MAX_TRX_ID], 0, 8);
  1496. if (m_page_zip_ptr) {
  1497. memset(&m_page_zip_ptr
  1498. ->data[PAGE_HEADER + PAGE_MAX_TRX_ID], 0, 8);
  1499. }
  1500. }
  1501. if (page_is_empty(page)) {
  1502. /* Only a root page can be empty. */
  1503. if (page_has_siblings(page)) {
  1504. // TODO: We should relax this and skip secondary
  1505. // indexes. Mark them as corrupt because they can
  1506. // always be rebuilt.
  1507. return(DB_CORRUPTION);
  1508. }
  1509. return(DB_SUCCESS);
  1510. }
  1511. return page_is_leaf(block->frame) ? update_records(block) : DB_SUCCESS;
  1512. }
  1513. /** Validate the space flags and update tablespace header page.
  1514. @param block block read from file, not from the buffer pool.
  1515. @retval DB_SUCCESS or error code */
  1516. inline
  1517. dberr_t
  1518. PageConverter::update_header(
  1519. buf_block_t* block) UNIV_NOTHROW
  1520. {
  1521. /* Check for valid header */
  1522. switch (fsp_header_get_space_id(get_frame(block))) {
  1523. case 0:
  1524. return(DB_CORRUPTION);
  1525. case ULINT_UNDEFINED:
  1526. ib::warn() << "Space id check in the header failed: ignored";
  1527. }
  1528. memset(get_frame(block) + FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION,0,8);
  1529. /* Write back the adjusted flags. */
  1530. mach_write_to_4(FSP_HEADER_OFFSET + FSP_SPACE_FLAGS
  1531. + get_frame(block), m_space_flags);
  1532. /* Write space_id to the tablespace header, page 0. */
  1533. mach_write_to_4(
  1534. get_frame(block) + FSP_HEADER_OFFSET + FSP_SPACE_ID,
  1535. get_space_id());
  1536. /* This is on every page in the tablespace. */
  1537. mach_write_to_4(
  1538. get_frame(block) + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID,
  1539. get_space_id());
  1540. return(DB_SUCCESS);
  1541. }
  1542. /** Update the page, set the space id, max trx id and index id.
  1543. @param block block read from file
  1544. @retval DB_SUCCESS or error code */
  1545. inline
  1546. dberr_t
  1547. PageConverter::update_page(buf_block_t* block, uint16_t& page_type)
  1548. UNIV_NOTHROW
  1549. {
  1550. dberr_t err = DB_SUCCESS;
  1551. ut_ad(!block->page.zip.data == !is_compressed_table());
  1552. if (block->page.zip.data) {
  1553. m_page_zip_ptr = &block->page.zip;
  1554. } else {
  1555. ut_ad(!m_page_zip_ptr);
  1556. }
  1557. switch (page_type = fil_page_get_type(get_frame(block))) {
  1558. case FIL_PAGE_TYPE_FSP_HDR:
  1559. ut_a(block->page.id.page_no() == 0);
  1560. /* Work directly on the uncompressed page headers. */
  1561. return(update_header(block));
  1562. case FIL_PAGE_INDEX:
  1563. case FIL_PAGE_RTREE:
  1564. /* We need to decompress the contents into block->frame
  1565. before we can do any thing with Btree pages. */
  1566. if (is_compressed_table() && !buf_zip_decompress(block, TRUE)) {
  1567. return(DB_CORRUPTION);
  1568. }
  1569. /* fall through */
  1570. case FIL_PAGE_TYPE_INSTANT:
  1571. /* This is on every page in the tablespace. */
  1572. mach_write_to_4(
  1573. get_frame(block)
  1574. + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID, get_space_id());
  1575. /* Only update the Btree nodes. */
  1576. return(update_index_page(block));
  1577. case FIL_PAGE_TYPE_SYS:
  1578. /* This is page 0 in the system tablespace. */
  1579. return(DB_CORRUPTION);
  1580. case FIL_PAGE_TYPE_XDES:
  1581. err = set_current_xdes(
  1582. block->page.id.page_no(), get_frame(block));
  1583. /* fall through */
  1584. case FIL_PAGE_INODE:
  1585. case FIL_PAGE_TYPE_TRX_SYS:
  1586. case FIL_PAGE_IBUF_FREE_LIST:
  1587. case FIL_PAGE_TYPE_ALLOCATED:
  1588. case FIL_PAGE_IBUF_BITMAP:
  1589. case FIL_PAGE_TYPE_BLOB:
  1590. case FIL_PAGE_TYPE_ZBLOB:
  1591. case FIL_PAGE_TYPE_ZBLOB2:
  1592. /* Work directly on the uncompressed page headers. */
  1593. /* This is on every page in the tablespace. */
  1594. mach_write_to_4(
  1595. get_frame(block)
  1596. + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID, get_space_id());
  1597. return(err);
  1598. }
  1599. ib::warn() << "Unknown page type (" << page_type << ")";
  1600. return(DB_CORRUPTION);
  1601. }
  1602. /** Called for every page in the tablespace. If the page was not
  1603. updated then its state must be set to BUF_PAGE_NOT_USED.
  1604. @param block block read from file, note it is not from the buffer pool
  1605. @retval DB_SUCCESS or error code. */
  1606. dberr_t PageConverter::operator()(buf_block_t* block) UNIV_NOTHROW
  1607. {
  1608. /* If we already had an old page with matching number
  1609. in the buffer pool, evict it now, because
  1610. we no longer evict the pages on DISCARD TABLESPACE. */
  1611. buf_page_get_gen(block->page.id, get_zip_size(),
  1612. RW_NO_LATCH, NULL, BUF_EVICT_IF_IN_POOL,
  1613. __FILE__, __LINE__, NULL, NULL);
  1614. uint16_t page_type;
  1615. if (dberr_t err = update_page(block, page_type)) {
  1616. return err;
  1617. }
  1618. const bool full_crc32 = fil_space_t::full_crc32(get_space_flags());
  1619. byte* frame = get_frame(block);
  1620. compile_time_assert(FIL_PAGE_LSN % 8 == 0);
  1621. *reinterpret_cast<uint64_t*>(frame + FIL_PAGE_LSN)= 0;
  1622. if (!block->page.zip.data) {
  1623. buf_flush_init_for_writing(
  1624. NULL, block->frame, NULL, full_crc32);
  1625. } else if (fil_page_type_is_index(page_type)) {
  1626. buf_flush_init_for_writing(
  1627. NULL, block->page.zip.data, &block->page.zip,
  1628. full_crc32);
  1629. } else {
  1630. /* Calculate and update the checksum of non-index
  1631. pages for ROW_FORMAT=COMPRESSED tables. */
  1632. buf_flush_update_zip_checksum(
  1633. block->page.zip.data, block->zip_size());
  1634. }
  1635. return DB_SUCCESS;
  1636. }
  1637. /*****************************************************************//**
  1638. Clean up after import tablespace failure, this function will acquire
  1639. the dictionary latches on behalf of the transaction if the transaction
  1640. hasn't already acquired them. */
  1641. static MY_ATTRIBUTE((nonnull))
  1642. void
  1643. row_import_discard_changes(
  1644. /*=======================*/
  1645. row_prebuilt_t* prebuilt, /*!< in/out: prebuilt from handler */
  1646. trx_t* trx, /*!< in/out: transaction for import */
  1647. dberr_t err) /*!< in: error code */
  1648. {
  1649. dict_table_t* table = prebuilt->table;
  1650. ut_a(err != DB_SUCCESS);
  1651. prebuilt->trx->error_info = NULL;
  1652. ib::info() << "Discarding tablespace of table "
  1653. << prebuilt->table->name
  1654. << ": " << ut_strerr(err);
  1655. if (trx->dict_operation_lock_mode != RW_X_LATCH) {
  1656. ut_a(trx->dict_operation_lock_mode == 0);
  1657. row_mysql_lock_data_dictionary(trx);
  1658. }
  1659. ut_a(trx->dict_operation_lock_mode == RW_X_LATCH);
  1660. /* Since we update the index root page numbers on disk after
  1661. we've done a successful import. The table will not be loadable.
  1662. However, we need to ensure that the in memory root page numbers
  1663. are reset to "NULL". */
  1664. for (dict_index_t* index = UT_LIST_GET_FIRST(table->indexes);
  1665. index != 0;
  1666. index = UT_LIST_GET_NEXT(indexes, index)) {
  1667. index->page = FIL_NULL;
  1668. }
  1669. table->file_unreadable = true;
  1670. if (table->space) {
  1671. fil_close_tablespace(trx, table->space_id);
  1672. table->space = NULL;
  1673. }
  1674. }
  1675. /*****************************************************************//**
  1676. Clean up after import tablespace. */
  1677. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  1678. dberr_t
  1679. row_import_cleanup(
  1680. /*===============*/
  1681. row_prebuilt_t* prebuilt, /*!< in/out: prebuilt from handler */
  1682. trx_t* trx, /*!< in/out: transaction for import */
  1683. dberr_t err) /*!< in: error code */
  1684. {
  1685. ut_a(prebuilt->trx != trx);
  1686. if (err != DB_SUCCESS) {
  1687. row_import_discard_changes(prebuilt, trx, err);
  1688. }
  1689. ut_a(trx->dict_operation_lock_mode == RW_X_LATCH);
  1690. DBUG_EXECUTE_IF("ib_import_before_commit_crash", DBUG_SUICIDE(););
  1691. trx_commit_for_mysql(trx);
  1692. row_mysql_unlock_data_dictionary(trx);
  1693. trx_free(trx);
  1694. prebuilt->trx->op_info = "";
  1695. DBUG_EXECUTE_IF("ib_import_before_checkpoint_crash", DBUG_SUICIDE(););
  1696. log_make_checkpoint();
  1697. return(err);
  1698. }
  1699. /*****************************************************************//**
  1700. Report error during tablespace import. */
  1701. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  1702. dberr_t
  1703. row_import_error(
  1704. /*=============*/
  1705. row_prebuilt_t* prebuilt, /*!< in/out: prebuilt from handler */
  1706. trx_t* trx, /*!< in/out: transaction for import */
  1707. dberr_t err) /*!< in: error code */
  1708. {
  1709. if (!trx_is_interrupted(trx)) {
  1710. char table_name[MAX_FULL_NAME_LEN + 1];
  1711. innobase_format_name(
  1712. table_name, sizeof(table_name),
  1713. prebuilt->table->name.m_name);
  1714. ib_senderrf(
  1715. trx->mysql_thd, IB_LOG_LEVEL_WARN,
  1716. ER_INNODB_IMPORT_ERROR,
  1717. table_name, (ulong) err, ut_strerr(err));
  1718. }
  1719. return(row_import_cleanup(prebuilt, trx, err));
  1720. }
  1721. /*****************************************************************//**
  1722. Adjust the root page index node and leaf node segment headers, update
  1723. with the new space id. For all the table's secondary indexes.
  1724. @return error code */
  1725. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  1726. dberr_t
  1727. row_import_adjust_root_pages_of_secondary_indexes(
  1728. /*==============================================*/
  1729. trx_t* trx, /*!< in: transaction used for
  1730. the import */
  1731. dict_table_t* table, /*!< in: table the indexes
  1732. belong to */
  1733. const row_import& cfg) /*!< Import context */
  1734. {
  1735. dict_index_t* index;
  1736. ulint n_rows_in_table;
  1737. dberr_t err = DB_SUCCESS;
  1738. /* Skip the clustered index. */
  1739. index = dict_table_get_first_index(table);
  1740. n_rows_in_table = cfg.get_n_rows(index->name);
  1741. DBUG_EXECUTE_IF("ib_import_sec_rec_count_mismatch_failure",
  1742. n_rows_in_table++;);
  1743. /* Adjust the root pages of the secondary indexes only. */
  1744. while ((index = dict_table_get_next_index(index)) != NULL) {
  1745. ut_a(!dict_index_is_clust(index));
  1746. if (!(index->type & DICT_CORRUPT)
  1747. && index->page != FIL_NULL) {
  1748. /* Update the Btree segment headers for index node and
  1749. leaf nodes in the root page. Set the new space id. */
  1750. err = btr_root_adjust_on_import(index);
  1751. } else {
  1752. ib::warn() << "Skip adjustment of root pages for"
  1753. " index " << index->name << ".";
  1754. err = DB_CORRUPTION;
  1755. }
  1756. if (err != DB_SUCCESS) {
  1757. if (index->type & DICT_CLUSTERED) {
  1758. break;
  1759. }
  1760. ib_errf(trx->mysql_thd,
  1761. IB_LOG_LEVEL_WARN,
  1762. ER_INNODB_INDEX_CORRUPT,
  1763. "Index %s not found or corrupt,"
  1764. " you should recreate this index.",
  1765. index->name());
  1766. /* Do not bail out, so that the data
  1767. can be recovered. */
  1768. err = DB_SUCCESS;
  1769. index->type |= DICT_CORRUPT;
  1770. continue;
  1771. }
  1772. /* If we failed to purge any records in the index then
  1773. do it the hard way.
  1774. TODO: We can do this in the first pass by generating UNDO log
  1775. records for the failed rows. */
  1776. if (!cfg.requires_purge(index->name)) {
  1777. continue;
  1778. }
  1779. IndexPurge purge(trx, index);
  1780. trx->op_info = "secondary: purge delete marked records";
  1781. err = purge.garbage_collect();
  1782. trx->op_info = "";
  1783. if (err != DB_SUCCESS) {
  1784. break;
  1785. } else if (purge.get_n_rows() != n_rows_in_table) {
  1786. ib_errf(trx->mysql_thd,
  1787. IB_LOG_LEVEL_WARN,
  1788. ER_INNODB_INDEX_CORRUPT,
  1789. "Index '%s' contains " ULINTPF " entries, "
  1790. "should be " ULINTPF ", you should recreate "
  1791. "this index.", index->name(),
  1792. purge.get_n_rows(), n_rows_in_table);
  1793. index->type |= DICT_CORRUPT;
  1794. /* Do not bail out, so that the data
  1795. can be recovered. */
  1796. err = DB_SUCCESS;
  1797. }
  1798. }
  1799. return(err);
  1800. }
  1801. /*****************************************************************//**
  1802. Ensure that dict_sys.row_id exceeds SELECT MAX(DB_ROW_ID). */
  1803. MY_ATTRIBUTE((nonnull)) static
  1804. void
  1805. row_import_set_sys_max_row_id(
  1806. /*==========================*/
  1807. row_prebuilt_t* prebuilt, /*!< in/out: prebuilt from
  1808. handler */
  1809. const dict_table_t* table) /*!< in: table to import */
  1810. {
  1811. const rec_t* rec;
  1812. mtr_t mtr;
  1813. btr_pcur_t pcur;
  1814. row_id_t row_id = 0;
  1815. dict_index_t* index;
  1816. index = dict_table_get_first_index(table);
  1817. ut_ad(index->is_primary());
  1818. ut_ad(dict_index_is_auto_gen_clust(index));
  1819. mtr_start(&mtr);
  1820. mtr_set_log_mode(&mtr, MTR_LOG_NO_REDO);
  1821. btr_pcur_open_at_index_side(
  1822. false, // High end
  1823. index,
  1824. BTR_SEARCH_LEAF,
  1825. &pcur,
  1826. true, // Init cursor
  1827. 0, // Leaf level
  1828. &mtr);
  1829. btr_pcur_move_to_prev_on_page(&pcur);
  1830. rec = btr_pcur_get_rec(&pcur);
  1831. /* Check for empty table. */
  1832. if (page_rec_is_infimum(rec)) {
  1833. /* The table is empty. */
  1834. } else if (rec_is_metadata(rec, *index)) {
  1835. /* The clustered index contains the metadata record only,
  1836. that is, the table is empty. */
  1837. } else {
  1838. row_id = mach_read_from_6(rec);
  1839. }
  1840. btr_pcur_close(&pcur);
  1841. mtr_commit(&mtr);
  1842. if (row_id) {
  1843. /* Update the system row id if the imported index row id is
  1844. greater than the max system row id. */
  1845. mutex_enter(&dict_sys.mutex);
  1846. if (row_id >= dict_sys.row_id) {
  1847. dict_sys.row_id = row_id + 1;
  1848. dict_hdr_flush_row_id();
  1849. }
  1850. mutex_exit(&dict_sys.mutex);
  1851. }
  1852. }
  1853. /*****************************************************************//**
  1854. Read the a string from the meta data file.
  1855. @return DB_SUCCESS or error code. */
  1856. static
  1857. dberr_t
  1858. row_import_cfg_read_string(
  1859. /*=======================*/
  1860. FILE* file, /*!< in/out: File to read from */
  1861. byte* ptr, /*!< out: string to read */
  1862. ulint max_len) /*!< in: maximum length of the output
  1863. buffer in bytes */
  1864. {
  1865. DBUG_EXECUTE_IF("ib_import_string_read_error",
  1866. errno = EINVAL; return(DB_IO_ERROR););
  1867. ulint len = 0;
  1868. while (!feof(file)) {
  1869. int ch = fgetc(file);
  1870. if (ch == EOF) {
  1871. break;
  1872. } else if (ch != 0) {
  1873. if (len < max_len) {
  1874. ptr[len++] = ch;
  1875. } else {
  1876. break;
  1877. }
  1878. /* max_len includes the NUL byte */
  1879. } else if (len != max_len - 1) {
  1880. break;
  1881. } else {
  1882. ptr[len] = 0;
  1883. return(DB_SUCCESS);
  1884. }
  1885. }
  1886. errno = EINVAL;
  1887. return(DB_IO_ERROR);
  1888. }
  1889. /*********************************************************************//**
  1890. Write the meta data (index user fields) config file.
  1891. @return DB_SUCCESS or error code. */
  1892. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  1893. dberr_t
  1894. row_import_cfg_read_index_fields(
  1895. /*=============================*/
  1896. FILE* file, /*!< in: file to write to */
  1897. THD* thd, /*!< in/out: session */
  1898. row_index_t* index) /*!< Index being read in */
  1899. {
  1900. byte row[sizeof(ib_uint32_t) * 3];
  1901. ulint n_fields = index->m_n_fields;
  1902. index->m_fields = UT_NEW_ARRAY_NOKEY(dict_field_t, n_fields);
  1903. /* Trigger OOM */
  1904. DBUG_EXECUTE_IF(
  1905. "ib_import_OOM_4",
  1906. UT_DELETE_ARRAY(index->m_fields);
  1907. index->m_fields = NULL;
  1908. );
  1909. if (index->m_fields == NULL) {
  1910. return(DB_OUT_OF_MEMORY);
  1911. }
  1912. dict_field_t* field = index->m_fields;
  1913. for (ulint i = 0; i < n_fields; ++i, ++field) {
  1914. byte* ptr = row;
  1915. /* Trigger EOF */
  1916. DBUG_EXECUTE_IF("ib_import_io_read_error_1",
  1917. (void) fseek(file, 0L, SEEK_END););
  1918. if (fread(row, 1, sizeof(row), file) != sizeof(row)) {
  1919. ib_senderrf(
  1920. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  1921. (ulong) errno, strerror(errno),
  1922. "while reading index fields.");
  1923. return(DB_IO_ERROR);
  1924. }
  1925. new (field) dict_field_t();
  1926. field->prefix_len = mach_read_from_4(ptr);
  1927. ptr += sizeof(ib_uint32_t);
  1928. field->fixed_len = mach_read_from_4(ptr);
  1929. ptr += sizeof(ib_uint32_t);
  1930. /* Include the NUL byte in the length. */
  1931. ulint len = mach_read_from_4(ptr);
  1932. byte* name = UT_NEW_ARRAY_NOKEY(byte, len);
  1933. /* Trigger OOM */
  1934. DBUG_EXECUTE_IF(
  1935. "ib_import_OOM_5",
  1936. UT_DELETE_ARRAY(name);
  1937. name = NULL;
  1938. );
  1939. if (name == NULL) {
  1940. return(DB_OUT_OF_MEMORY);
  1941. }
  1942. field->name = reinterpret_cast<const char*>(name);
  1943. dberr_t err = row_import_cfg_read_string(file, name, len);
  1944. if (err != DB_SUCCESS) {
  1945. ib_senderrf(
  1946. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  1947. (ulong) errno, strerror(errno),
  1948. "while parsing table name.");
  1949. return(err);
  1950. }
  1951. }
  1952. return(DB_SUCCESS);
  1953. }
  1954. /*****************************************************************//**
  1955. Read the index names and root page numbers of the indexes and set the values.
  1956. Row format [root_page_no, len of str, str ... ]
  1957. @return DB_SUCCESS or error code. */
  1958. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  1959. dberr_t
  1960. row_import_read_index_data(
  1961. /*=======================*/
  1962. FILE* file, /*!< in: File to read from */
  1963. THD* thd, /*!< in: session */
  1964. row_import* cfg) /*!< in/out: meta-data read */
  1965. {
  1966. byte* ptr;
  1967. row_index_t* cfg_index;
  1968. byte row[sizeof(index_id_t) + sizeof(ib_uint32_t) * 9];
  1969. /* FIXME: What is the max value? */
  1970. ut_a(cfg->m_n_indexes > 0);
  1971. ut_a(cfg->m_n_indexes < 1024);
  1972. cfg->m_indexes = UT_NEW_ARRAY_NOKEY(row_index_t, cfg->m_n_indexes);
  1973. /* Trigger OOM */
  1974. DBUG_EXECUTE_IF(
  1975. "ib_import_OOM_6",
  1976. UT_DELETE_ARRAY(cfg->m_indexes);
  1977. cfg->m_indexes = NULL;
  1978. );
  1979. if (cfg->m_indexes == NULL) {
  1980. return(DB_OUT_OF_MEMORY);
  1981. }
  1982. memset(cfg->m_indexes, 0x0, sizeof(*cfg->m_indexes) * cfg->m_n_indexes);
  1983. cfg_index = cfg->m_indexes;
  1984. for (ulint i = 0; i < cfg->m_n_indexes; ++i, ++cfg_index) {
  1985. /* Trigger EOF */
  1986. DBUG_EXECUTE_IF("ib_import_io_read_error_2",
  1987. (void) fseek(file, 0L, SEEK_END););
  1988. /* Read the index data. */
  1989. size_t n_bytes = fread(row, 1, sizeof(row), file);
  1990. /* Trigger EOF */
  1991. DBUG_EXECUTE_IF("ib_import_io_read_error",
  1992. (void) fseek(file, 0L, SEEK_END););
  1993. if (n_bytes != sizeof(row)) {
  1994. char msg[BUFSIZ];
  1995. snprintf(msg, sizeof(msg),
  1996. "while reading index meta-data, expected "
  1997. "to read " ULINTPF
  1998. " bytes but read only " ULINTPF " bytes",
  1999. sizeof(row), n_bytes);
  2000. ib_senderrf(
  2001. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2002. (ulong) errno, strerror(errno), msg);
  2003. ib::error() << "IO Error: " << msg;
  2004. return(DB_IO_ERROR);
  2005. }
  2006. ptr = row;
  2007. cfg_index->m_id = mach_read_from_8(ptr);
  2008. ptr += sizeof(index_id_t);
  2009. cfg_index->m_space = mach_read_from_4(ptr);
  2010. ptr += sizeof(ib_uint32_t);
  2011. cfg_index->m_page_no = mach_read_from_4(ptr);
  2012. ptr += sizeof(ib_uint32_t);
  2013. cfg_index->m_type = mach_read_from_4(ptr);
  2014. ptr += sizeof(ib_uint32_t);
  2015. cfg_index->m_trx_id_offset = mach_read_from_4(ptr);
  2016. if (cfg_index->m_trx_id_offset != mach_read_from_4(ptr)) {
  2017. ut_ad(0);
  2018. /* Overflow. Pretend that the clustered index
  2019. has a variable-length PRIMARY KEY. */
  2020. cfg_index->m_trx_id_offset = 0;
  2021. }
  2022. ptr += sizeof(ib_uint32_t);
  2023. cfg_index->m_n_user_defined_cols = mach_read_from_4(ptr);
  2024. ptr += sizeof(ib_uint32_t);
  2025. cfg_index->m_n_uniq = mach_read_from_4(ptr);
  2026. ptr += sizeof(ib_uint32_t);
  2027. cfg_index->m_n_nullable = mach_read_from_4(ptr);
  2028. ptr += sizeof(ib_uint32_t);
  2029. cfg_index->m_n_fields = mach_read_from_4(ptr);
  2030. ptr += sizeof(ib_uint32_t);
  2031. /* The NUL byte is included in the name length. */
  2032. ulint len = mach_read_from_4(ptr);
  2033. if (len > OS_FILE_MAX_PATH) {
  2034. ib_errf(thd, IB_LOG_LEVEL_ERROR,
  2035. ER_INNODB_INDEX_CORRUPT,
  2036. "Index name length (" ULINTPF ") is too long, "
  2037. "the meta-data is corrupt", len);
  2038. return(DB_CORRUPTION);
  2039. }
  2040. cfg_index->m_name = UT_NEW_ARRAY_NOKEY(byte, len);
  2041. /* Trigger OOM */
  2042. DBUG_EXECUTE_IF(
  2043. "ib_import_OOM_7",
  2044. UT_DELETE_ARRAY(cfg_index->m_name);
  2045. cfg_index->m_name = NULL;
  2046. );
  2047. if (cfg_index->m_name == NULL) {
  2048. return(DB_OUT_OF_MEMORY);
  2049. }
  2050. dberr_t err;
  2051. err = row_import_cfg_read_string(file, cfg_index->m_name, len);
  2052. if (err != DB_SUCCESS) {
  2053. ib_senderrf(
  2054. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2055. (ulong) errno, strerror(errno),
  2056. "while parsing index name.");
  2057. return(err);
  2058. }
  2059. err = row_import_cfg_read_index_fields(file, thd, cfg_index);
  2060. if (err != DB_SUCCESS) {
  2061. return(err);
  2062. }
  2063. }
  2064. return(DB_SUCCESS);
  2065. }
  2066. /*****************************************************************//**
  2067. Set the index root page number for v1 format.
  2068. @return DB_SUCCESS or error code. */
  2069. static
  2070. dberr_t
  2071. row_import_read_indexes(
  2072. /*====================*/
  2073. FILE* file, /*!< in: File to read from */
  2074. THD* thd, /*!< in: session */
  2075. row_import* cfg) /*!< in/out: meta-data read */
  2076. {
  2077. byte row[sizeof(ib_uint32_t)];
  2078. /* Trigger EOF */
  2079. DBUG_EXECUTE_IF("ib_import_io_read_error_3",
  2080. (void) fseek(file, 0L, SEEK_END););
  2081. /* Read the number of indexes. */
  2082. if (fread(row, 1, sizeof(row), file) != sizeof(row)) {
  2083. ib_senderrf(
  2084. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2085. (ulong) errno, strerror(errno),
  2086. "while reading number of indexes.");
  2087. return(DB_IO_ERROR);
  2088. }
  2089. cfg->m_n_indexes = mach_read_from_4(row);
  2090. if (cfg->m_n_indexes == 0) {
  2091. ib_errf(thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2092. "Number of indexes in meta-data file is 0");
  2093. return(DB_CORRUPTION);
  2094. } else if (cfg->m_n_indexes > 1024) {
  2095. // FIXME: What is the upper limit? */
  2096. ib_errf(thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2097. "Number of indexes in meta-data file is too high: "
  2098. ULINTPF, cfg->m_n_indexes);
  2099. cfg->m_n_indexes = 0;
  2100. return(DB_CORRUPTION);
  2101. }
  2102. return(row_import_read_index_data(file, thd, cfg));
  2103. }
  2104. /*********************************************************************//**
  2105. Read the meta data (table columns) config file. Deserialise the contents of
  2106. dict_col_t structure, along with the column name. */
  2107. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  2108. dberr_t
  2109. row_import_read_columns(
  2110. /*====================*/
  2111. FILE* file, /*!< in: file to write to */
  2112. THD* thd, /*!< in/out: session */
  2113. row_import* cfg) /*!< in/out: meta-data read */
  2114. {
  2115. dict_col_t* col;
  2116. byte row[sizeof(ib_uint32_t) * 8];
  2117. /* FIXME: What should the upper limit be? */
  2118. ut_a(cfg->m_n_cols > 0);
  2119. ut_a(cfg->m_n_cols < 1024);
  2120. cfg->m_cols = UT_NEW_ARRAY_NOKEY(dict_col_t, cfg->m_n_cols);
  2121. /* Trigger OOM */
  2122. DBUG_EXECUTE_IF(
  2123. "ib_import_OOM_8",
  2124. UT_DELETE_ARRAY(cfg->m_cols);
  2125. cfg->m_cols = NULL;
  2126. );
  2127. if (cfg->m_cols == NULL) {
  2128. return(DB_OUT_OF_MEMORY);
  2129. }
  2130. cfg->m_col_names = UT_NEW_ARRAY_NOKEY(byte*, cfg->m_n_cols);
  2131. /* Trigger OOM */
  2132. DBUG_EXECUTE_IF(
  2133. "ib_import_OOM_9",
  2134. UT_DELETE_ARRAY(cfg->m_col_names);
  2135. cfg->m_col_names = NULL;
  2136. );
  2137. if (cfg->m_col_names == NULL) {
  2138. return(DB_OUT_OF_MEMORY);
  2139. }
  2140. memset(cfg->m_cols, 0x0, sizeof(cfg->m_cols) * cfg->m_n_cols);
  2141. memset(cfg->m_col_names, 0x0, sizeof(cfg->m_col_names) * cfg->m_n_cols);
  2142. col = cfg->m_cols;
  2143. for (ulint i = 0; i < cfg->m_n_cols; ++i, ++col) {
  2144. byte* ptr = row;
  2145. /* Trigger EOF */
  2146. DBUG_EXECUTE_IF("ib_import_io_read_error_4",
  2147. (void) fseek(file, 0L, SEEK_END););
  2148. if (fread(row, 1, sizeof(row), file) != sizeof(row)) {
  2149. ib_senderrf(
  2150. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2151. (ulong) errno, strerror(errno),
  2152. "while reading table column meta-data.");
  2153. return(DB_IO_ERROR);
  2154. }
  2155. col->prtype = mach_read_from_4(ptr);
  2156. ptr += sizeof(ib_uint32_t);
  2157. col->mtype = mach_read_from_4(ptr);
  2158. ptr += sizeof(ib_uint32_t);
  2159. col->len = mach_read_from_4(ptr);
  2160. ptr += sizeof(ib_uint32_t);
  2161. ulint mbminmaxlen = mach_read_from_4(ptr);
  2162. col->mbmaxlen = mbminmaxlen / 5;
  2163. col->mbminlen = mbminmaxlen % 5;
  2164. ptr += sizeof(ib_uint32_t);
  2165. col->ind = mach_read_from_4(ptr);
  2166. ptr += sizeof(ib_uint32_t);
  2167. col->ord_part = mach_read_from_4(ptr);
  2168. ptr += sizeof(ib_uint32_t);
  2169. col->max_prefix = mach_read_from_4(ptr);
  2170. ptr += sizeof(ib_uint32_t);
  2171. /* Read in the column name as [len, byte array]. The len
  2172. includes the NUL byte. */
  2173. ulint len = mach_read_from_4(ptr);
  2174. /* FIXME: What is the maximum column name length? */
  2175. if (len == 0 || len > 128) {
  2176. ib_errf(thd, IB_LOG_LEVEL_ERROR,
  2177. ER_IO_READ_ERROR,
  2178. "Column name length " ULINTPF ", is invalid",
  2179. len);
  2180. return(DB_CORRUPTION);
  2181. }
  2182. cfg->m_col_names[i] = UT_NEW_ARRAY_NOKEY(byte, len);
  2183. /* Trigger OOM */
  2184. DBUG_EXECUTE_IF(
  2185. "ib_import_OOM_10",
  2186. UT_DELETE_ARRAY(cfg->m_col_names[i]);
  2187. cfg->m_col_names[i] = NULL;
  2188. );
  2189. if (cfg->m_col_names[i] == NULL) {
  2190. return(DB_OUT_OF_MEMORY);
  2191. }
  2192. dberr_t err;
  2193. err = row_import_cfg_read_string(
  2194. file, cfg->m_col_names[i], len);
  2195. if (err != DB_SUCCESS) {
  2196. ib_senderrf(
  2197. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2198. (ulong) errno, strerror(errno),
  2199. "while parsing table column name.");
  2200. return(err);
  2201. }
  2202. }
  2203. return(DB_SUCCESS);
  2204. }
  2205. /*****************************************************************//**
  2206. Read the contents of the <tablespace>.cfg file.
  2207. @return DB_SUCCESS or error code. */
  2208. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  2209. dberr_t
  2210. row_import_read_v1(
  2211. /*===============*/
  2212. FILE* file, /*!< in: File to read from */
  2213. THD* thd, /*!< in: session */
  2214. row_import* cfg) /*!< out: meta data */
  2215. {
  2216. byte value[sizeof(ib_uint32_t)];
  2217. /* Trigger EOF */
  2218. DBUG_EXECUTE_IF("ib_import_io_read_error_5",
  2219. (void) fseek(file, 0L, SEEK_END););
  2220. /* Read the hostname where the tablespace was exported. */
  2221. if (fread(value, 1, sizeof(value), file) != sizeof(value)) {
  2222. ib_senderrf(
  2223. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2224. (ulong) errno, strerror(errno),
  2225. "while reading meta-data export hostname length.");
  2226. return(DB_IO_ERROR);
  2227. }
  2228. ulint len = mach_read_from_4(value);
  2229. /* NUL byte is part of name length. */
  2230. cfg->m_hostname = UT_NEW_ARRAY_NOKEY(byte, len);
  2231. /* Trigger OOM */
  2232. DBUG_EXECUTE_IF(
  2233. "ib_import_OOM_1",
  2234. UT_DELETE_ARRAY(cfg->m_hostname);
  2235. cfg->m_hostname = NULL;
  2236. );
  2237. if (cfg->m_hostname == NULL) {
  2238. return(DB_OUT_OF_MEMORY);
  2239. }
  2240. dberr_t err = row_import_cfg_read_string(file, cfg->m_hostname, len);
  2241. if (err != DB_SUCCESS) {
  2242. ib_senderrf(
  2243. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2244. (ulong) errno, strerror(errno),
  2245. "while parsing export hostname.");
  2246. return(err);
  2247. }
  2248. /* Trigger EOF */
  2249. DBUG_EXECUTE_IF("ib_import_io_read_error_6",
  2250. (void) fseek(file, 0L, SEEK_END););
  2251. /* Read the table name of tablespace that was exported. */
  2252. if (fread(value, 1, sizeof(value), file) != sizeof(value)) {
  2253. ib_senderrf(
  2254. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2255. (ulong) errno, strerror(errno),
  2256. "while reading meta-data table name length.");
  2257. return(DB_IO_ERROR);
  2258. }
  2259. len = mach_read_from_4(value);
  2260. /* NUL byte is part of name length. */
  2261. cfg->m_table_name = UT_NEW_ARRAY_NOKEY(byte, len);
  2262. /* Trigger OOM */
  2263. DBUG_EXECUTE_IF(
  2264. "ib_import_OOM_2",
  2265. UT_DELETE_ARRAY(cfg->m_table_name);
  2266. cfg->m_table_name = NULL;
  2267. );
  2268. if (cfg->m_table_name == NULL) {
  2269. return(DB_OUT_OF_MEMORY);
  2270. }
  2271. err = row_import_cfg_read_string(file, cfg->m_table_name, len);
  2272. if (err != DB_SUCCESS) {
  2273. ib_senderrf(
  2274. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2275. (ulong) errno, strerror(errno),
  2276. "while parsing table name.");
  2277. return(err);
  2278. }
  2279. ib::info() << "Importing tablespace for table '" << cfg->m_table_name
  2280. << "' that was exported from host '" << cfg->m_hostname << "'";
  2281. byte row[sizeof(ib_uint32_t) * 3];
  2282. /* Trigger EOF */
  2283. DBUG_EXECUTE_IF("ib_import_io_read_error_7",
  2284. (void) fseek(file, 0L, SEEK_END););
  2285. /* Read the autoinc value. */
  2286. if (fread(row, 1, sizeof(ib_uint64_t), file) != sizeof(ib_uint64_t)) {
  2287. ib_senderrf(
  2288. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2289. (ulong) errno, strerror(errno),
  2290. "while reading autoinc value.");
  2291. return(DB_IO_ERROR);
  2292. }
  2293. cfg->m_autoinc = mach_read_from_8(row);
  2294. /* Trigger EOF */
  2295. DBUG_EXECUTE_IF("ib_import_io_read_error_8",
  2296. (void) fseek(file, 0L, SEEK_END););
  2297. /* Read the tablespace page size. */
  2298. if (fread(row, 1, sizeof(row), file) != sizeof(row)) {
  2299. ib_senderrf(
  2300. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2301. (ulong) errno, strerror(errno),
  2302. "while reading meta-data header.");
  2303. return(DB_IO_ERROR);
  2304. }
  2305. byte* ptr = row;
  2306. const ulint logical_page_size = mach_read_from_4(ptr);
  2307. ptr += sizeof(ib_uint32_t);
  2308. if (logical_page_size != srv_page_size) {
  2309. ib_errf(thd, IB_LOG_LEVEL_ERROR, ER_TABLE_SCHEMA_MISMATCH,
  2310. "Tablespace to be imported has a different"
  2311. " page size than this server. Server page size"
  2312. " is %lu, whereas tablespace page size"
  2313. " is " ULINTPF,
  2314. srv_page_size,
  2315. logical_page_size);
  2316. return(DB_ERROR);
  2317. }
  2318. cfg->m_flags = mach_read_from_4(ptr);
  2319. ptr += sizeof(ib_uint32_t);
  2320. cfg->m_zip_size = dict_tf_get_zip_size(cfg->m_flags);
  2321. cfg->m_n_cols = mach_read_from_4(ptr);
  2322. if (!dict_tf_is_valid(cfg->m_flags)) {
  2323. ib_errf(thd, IB_LOG_LEVEL_ERROR,
  2324. ER_TABLE_SCHEMA_MISMATCH,
  2325. "Invalid table flags: " ULINTPF, cfg->m_flags);
  2326. return(DB_CORRUPTION);
  2327. }
  2328. err = row_import_read_columns(file, thd, cfg);
  2329. if (err == DB_SUCCESS) {
  2330. err = row_import_read_indexes(file, thd, cfg);
  2331. }
  2332. return(err);
  2333. }
  2334. /**
  2335. Read the contents of the <tablespace>.cfg file.
  2336. @return DB_SUCCESS or error code. */
  2337. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  2338. dberr_t
  2339. row_import_read_meta_data(
  2340. /*======================*/
  2341. FILE* file, /*!< in: File to read from */
  2342. THD* thd, /*!< in: session */
  2343. row_import& cfg) /*!< out: contents of the .cfg file */
  2344. {
  2345. byte row[sizeof(ib_uint32_t)];
  2346. /* Trigger EOF */
  2347. DBUG_EXECUTE_IF("ib_import_io_read_error_9",
  2348. (void) fseek(file, 0L, SEEK_END););
  2349. if (fread(&row, 1, sizeof(row), file) != sizeof(row)) {
  2350. ib_senderrf(
  2351. thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2352. (ulong) errno, strerror(errno),
  2353. "while reading meta-data version.");
  2354. return(DB_IO_ERROR);
  2355. }
  2356. cfg.m_version = mach_read_from_4(row);
  2357. /* Check the version number. */
  2358. switch (cfg.m_version) {
  2359. case IB_EXPORT_CFG_VERSION_V1:
  2360. return(row_import_read_v1(file, thd, &cfg));
  2361. default:
  2362. ib_errf(thd, IB_LOG_LEVEL_ERROR, ER_IO_READ_ERROR,
  2363. "Unsupported meta-data version number (" ULINTPF "), "
  2364. "file ignored", cfg.m_version);
  2365. }
  2366. return(DB_ERROR);
  2367. }
  2368. /**
  2369. Read the contents of the <tablename>.cfg file.
  2370. @return DB_SUCCESS or error code. */
  2371. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  2372. dberr_t
  2373. row_import_read_cfg(
  2374. /*================*/
  2375. dict_table_t* table, /*!< in: table */
  2376. THD* thd, /*!< in: session */
  2377. row_import& cfg) /*!< out: contents of the .cfg file */
  2378. {
  2379. dberr_t err;
  2380. char name[OS_FILE_MAX_PATH];
  2381. cfg.m_table = table;
  2382. srv_get_meta_data_filename(table, name, sizeof(name));
  2383. FILE* file = fopen(name, "rb");
  2384. if (file == NULL) {
  2385. char msg[BUFSIZ];
  2386. snprintf(msg, sizeof(msg),
  2387. "Error opening '%s', will attempt to import"
  2388. " without schema verification", name);
  2389. ib_senderrf(
  2390. thd, IB_LOG_LEVEL_WARN, ER_IO_READ_ERROR,
  2391. (ulong) errno, strerror(errno), msg);
  2392. cfg.m_missing = true;
  2393. err = DB_FAIL;
  2394. } else {
  2395. cfg.m_missing = false;
  2396. err = row_import_read_meta_data(file, thd, cfg);
  2397. fclose(file);
  2398. }
  2399. return(err);
  2400. }
  2401. /** Update the root page numbers and tablespace ID of a table.
  2402. @param[in,out] trx dictionary transaction
  2403. @param[in,out] table persistent table
  2404. @param[in] reset whether to reset the fields to FIL_NULL
  2405. @return DB_SUCCESS or error code */
  2406. dberr_t
  2407. row_import_update_index_root(trx_t* trx, dict_table_t* table, bool reset)
  2408. {
  2409. const dict_index_t* index;
  2410. que_t* graph = 0;
  2411. dberr_t err = DB_SUCCESS;
  2412. ut_ad(reset || table->space->id == table->space_id);
  2413. static const char sql[] = {
  2414. "PROCEDURE UPDATE_INDEX_ROOT() IS\n"
  2415. "BEGIN\n"
  2416. "UPDATE SYS_INDEXES\n"
  2417. "SET SPACE = :space,\n"
  2418. " PAGE_NO = :page,\n"
  2419. " TYPE = :type\n"
  2420. "WHERE TABLE_ID = :table_id AND ID = :index_id;\n"
  2421. "END;\n"};
  2422. table->def_trx_id = trx->id;
  2423. for (index = dict_table_get_first_index(table);
  2424. index != 0;
  2425. index = dict_table_get_next_index(index)) {
  2426. pars_info_t* info;
  2427. ib_uint32_t page;
  2428. ib_uint32_t space;
  2429. ib_uint32_t type;
  2430. index_id_t index_id;
  2431. table_id_t table_id;
  2432. info = (graph != 0) ? graph->info : pars_info_create();
  2433. mach_write_to_4(
  2434. reinterpret_cast<byte*>(&type),
  2435. index->type);
  2436. mach_write_to_4(
  2437. reinterpret_cast<byte*>(&page),
  2438. reset ? FIL_NULL : index->page);
  2439. mach_write_to_4(
  2440. reinterpret_cast<byte*>(&space),
  2441. reset ? FIL_NULL : index->table->space_id);
  2442. mach_write_to_8(
  2443. reinterpret_cast<byte*>(&index_id),
  2444. index->id);
  2445. mach_write_to_8(
  2446. reinterpret_cast<byte*>(&table_id),
  2447. table->id);
  2448. /* If we set the corrupt bit during the IMPORT phase then
  2449. we need to update the system tables. */
  2450. pars_info_bind_int4_literal(info, "type", &type);
  2451. pars_info_bind_int4_literal(info, "space", &space);
  2452. pars_info_bind_int4_literal(info, "page", &page);
  2453. pars_info_bind_ull_literal(info, "index_id", &index_id);
  2454. pars_info_bind_ull_literal(info, "table_id", &table_id);
  2455. if (graph == 0) {
  2456. graph = pars_sql(info, sql);
  2457. ut_a(graph);
  2458. graph->trx = trx;
  2459. }
  2460. que_thr_t* thr;
  2461. graph->fork_type = QUE_FORK_MYSQL_INTERFACE;
  2462. ut_a(thr = que_fork_start_command(graph));
  2463. que_run_threads(thr);
  2464. DBUG_EXECUTE_IF("ib_import_internal_error",
  2465. trx->error_state = DB_ERROR;);
  2466. err = trx->error_state;
  2467. if (err != DB_SUCCESS) {
  2468. ib_errf(trx->mysql_thd, IB_LOG_LEVEL_ERROR,
  2469. ER_INTERNAL_ERROR,
  2470. "While updating the <space, root page"
  2471. " number> of index %s - %s",
  2472. index->name(), ut_strerr(err));
  2473. break;
  2474. }
  2475. }
  2476. que_graph_free(graph);
  2477. return(err);
  2478. }
  2479. /** Callback arg for row_import_set_discarded. */
  2480. struct discard_t {
  2481. ib_uint32_t flags2; /*!< Value read from column */
  2482. bool state; /*!< New state of the flag */
  2483. ulint n_recs; /*!< Number of recs processed */
  2484. };
  2485. /******************************************************************//**
  2486. Fetch callback that sets or unsets the DISCARDED tablespace flag in
  2487. SYS_TABLES. The flags is stored in MIX_LEN column.
  2488. @return FALSE if all OK */
  2489. static
  2490. ibool
  2491. row_import_set_discarded(
  2492. /*=====================*/
  2493. void* row, /*!< in: sel_node_t* */
  2494. void* user_arg) /*!< in: bool set/unset flag */
  2495. {
  2496. sel_node_t* node = static_cast<sel_node_t*>(row);
  2497. discard_t* discard = static_cast<discard_t*>(user_arg);
  2498. dfield_t* dfield = que_node_get_val(node->select_list);
  2499. dtype_t* type = dfield_get_type(dfield);
  2500. ulint len = dfield_get_len(dfield);
  2501. ut_a(dtype_get_mtype(type) == DATA_INT);
  2502. ut_a(len == sizeof(ib_uint32_t));
  2503. ulint flags2 = mach_read_from_4(
  2504. static_cast<byte*>(dfield_get_data(dfield)));
  2505. if (discard->state) {
  2506. flags2 |= DICT_TF2_DISCARDED;
  2507. } else {
  2508. flags2 &= ~DICT_TF2_DISCARDED;
  2509. }
  2510. mach_write_to_4(reinterpret_cast<byte*>(&discard->flags2), flags2);
  2511. ++discard->n_recs;
  2512. /* There should be at most one matching record. */
  2513. ut_a(discard->n_recs == 1);
  2514. return(FALSE);
  2515. }
  2516. /** Update the DICT_TF2_DISCARDED flag in SYS_TABLES.MIX_LEN.
  2517. @param[in,out] trx dictionary transaction
  2518. @param[in] table_id table identifier
  2519. @param[in] discarded whether to set or clear the flag
  2520. @return DB_SUCCESS or error code */
  2521. dberr_t row_import_update_discarded_flag(trx_t* trx, table_id_t table_id,
  2522. bool discarded)
  2523. {
  2524. pars_info_t* info;
  2525. discard_t discard;
  2526. static const char sql[] =
  2527. "PROCEDURE UPDATE_DISCARDED_FLAG() IS\n"
  2528. "DECLARE FUNCTION my_func;\n"
  2529. "DECLARE CURSOR c IS\n"
  2530. " SELECT MIX_LEN"
  2531. " FROM SYS_TABLES"
  2532. " WHERE ID = :table_id FOR UPDATE;"
  2533. "\n"
  2534. "BEGIN\n"
  2535. "OPEN c;\n"
  2536. "WHILE 1 = 1 LOOP\n"
  2537. " FETCH c INTO my_func();\n"
  2538. " IF c % NOTFOUND THEN\n"
  2539. " EXIT;\n"
  2540. " END IF;\n"
  2541. "END LOOP;\n"
  2542. "UPDATE SYS_TABLES"
  2543. " SET MIX_LEN = :flags2"
  2544. " WHERE ID = :table_id;\n"
  2545. "CLOSE c;\n"
  2546. "END;\n";
  2547. discard.n_recs = 0;
  2548. discard.state = discarded;
  2549. discard.flags2 = ULINT32_UNDEFINED;
  2550. info = pars_info_create();
  2551. pars_info_add_ull_literal(info, "table_id", table_id);
  2552. pars_info_bind_int4_literal(info, "flags2", &discard.flags2);
  2553. pars_info_bind_function(
  2554. info, "my_func", row_import_set_discarded, &discard);
  2555. dberr_t err = que_eval_sql(info, sql, false, trx);
  2556. ut_a(discard.n_recs == 1);
  2557. ut_a(discard.flags2 != ULINT32_UNDEFINED);
  2558. return(err);
  2559. }
  2560. struct fil_iterator_t {
  2561. pfs_os_file_t file; /*!< File handle */
  2562. const char* filepath; /*!< File path name */
  2563. os_offset_t start; /*!< From where to start */
  2564. os_offset_t end; /*!< Where to stop */
  2565. os_offset_t file_size; /*!< File size in bytes */
  2566. ulint n_io_buffers; /*!< Number of pages to use
  2567. for IO */
  2568. byte* io_buffer; /*!< Buffer to use for IO */
  2569. fil_space_crypt_t *crypt_data; /*!< Crypt data (if encrypted) */
  2570. byte* crypt_io_buffer; /*!< IO buffer when encrypted */
  2571. };
  2572. /********************************************************************//**
  2573. TODO: This can be made parallel trivially by chunking up the file and creating
  2574. a callback per thread. . Main benefit will be to use multiple CPUs for
  2575. checksums and compressed tables. We have to do compressed tables block by
  2576. block right now. Secondly we need to decompress/compress and copy too much
  2577. of data. These are CPU intensive.
  2578. Iterate over all the pages in the tablespace.
  2579. @param iter - Tablespace iterator
  2580. @param block - block to use for IO
  2581. @param callback - Callback to inspect and update page contents
  2582. @retval DB_SUCCESS or error code */
  2583. static
  2584. dberr_t
  2585. fil_iterate(
  2586. /*========*/
  2587. const fil_iterator_t& iter,
  2588. buf_block_t* block,
  2589. AbstractCallback& callback)
  2590. {
  2591. os_offset_t offset;
  2592. const ulint size = callback.physical_size();
  2593. ulint n_bytes = iter.n_io_buffers * size;
  2594. const ulint buf_size = srv_page_size
  2595. #ifdef HAVE_LZO
  2596. + LZO1X_1_15_MEM_COMPRESS
  2597. #elif defined HAVE_SNAPPY
  2598. + snappy_max_compressed_length(srv_page_size)
  2599. #endif
  2600. ;
  2601. byte* page_compress_buf = static_cast<byte*>(malloc(buf_size));
  2602. ut_ad(!srv_read_only_mode);
  2603. if (!page_compress_buf) {
  2604. return DB_OUT_OF_MEMORY;
  2605. }
  2606. ulint actual_space_id = 0;
  2607. const bool full_crc32 = fil_space_t::full_crc32(
  2608. callback.get_space_flags());
  2609. /* TODO: For ROW_FORMAT=COMPRESSED tables we do a lot of useless
  2610. copying for non-index pages. Unfortunately, it is
  2611. required by buf_zip_decompress() */
  2612. dberr_t err = DB_SUCCESS;
  2613. for (offset = iter.start; offset < iter.end; offset += n_bytes) {
  2614. if (callback.is_interrupted()) {
  2615. err = DB_INTERRUPTED;
  2616. goto func_exit;
  2617. }
  2618. byte* io_buffer = iter.io_buffer;
  2619. block->frame = io_buffer;
  2620. if (block->page.zip.data) {
  2621. /* Zip IO is done in the compressed page buffer. */
  2622. io_buffer = block->page.zip.data;
  2623. }
  2624. /* We have to read the exact number of bytes. Otherwise the
  2625. InnoDB IO functions croak on failed reads. */
  2626. n_bytes = ulint(ut_min(os_offset_t(n_bytes),
  2627. iter.end - offset));
  2628. ut_ad(n_bytes > 0);
  2629. ut_ad(!(n_bytes % size));
  2630. const bool encrypted = iter.crypt_data != NULL
  2631. && iter.crypt_data->should_encrypt();
  2632. /* Use additional crypt io buffer if tablespace is encrypted */
  2633. byte* const readptr = encrypted
  2634. ? iter.crypt_io_buffer : io_buffer;
  2635. byte* const writeptr = readptr;
  2636. IORequest read_request(IORequest::READ);
  2637. read_request.disable_partial_io_warnings();
  2638. err = os_file_read_no_error_handling(
  2639. read_request, iter.file, readptr, offset, n_bytes, 0);
  2640. if (err != DB_SUCCESS) {
  2641. ib::error() << iter.filepath
  2642. << ": os_file_read() failed";
  2643. goto func_exit;
  2644. }
  2645. bool updated = false;
  2646. os_offset_t page_off = offset;
  2647. ulint n_pages_read = n_bytes / size;
  2648. block->page.id.set_page_no(ulint(page_off / size));
  2649. for (ulint i = 0; i < n_pages_read;
  2650. block->page.id.set_page_no(block->page.id.page_no() + 1),
  2651. ++i, page_off += size, block->frame += size) {
  2652. byte* src = readptr + i * size;
  2653. const ulint page_no = page_get_page_no(src);
  2654. if (!page_no && block->page.id.page_no()) {
  2655. if (!buf_page_is_zeroes(src, size)) {
  2656. goto page_corrupted;
  2657. }
  2658. /* Proceed to the next page,
  2659. because this one is all zero. */
  2660. continue;
  2661. }
  2662. if (page_no != block->page.id.page_no()) {
  2663. page_corrupted:
  2664. ib::warn() << callback.filename()
  2665. << ": Page " << (offset / size)
  2666. << " at offset " << offset
  2667. << " looks corrupted.";
  2668. err = DB_CORRUPTION;
  2669. goto func_exit;
  2670. }
  2671. if (block->page.id.page_no() == 0) {
  2672. actual_space_id = mach_read_from_4(
  2673. src + FIL_PAGE_SPACE_ID);
  2674. }
  2675. const bool page_compressed =
  2676. (full_crc32
  2677. && fil_space_t::is_compressed(
  2678. callback.get_space_flags())
  2679. && buf_page_is_compressed(
  2680. src, callback.get_space_flags()))
  2681. || (fil_page_is_compressed_encrypted(src)
  2682. || fil_page_is_compressed(src));
  2683. if (page_compressed && block->page.zip.data) {
  2684. goto page_corrupted;
  2685. }
  2686. bool decrypted = false;
  2687. byte* dst = io_buffer + i * size;
  2688. bool frame_changed = false;
  2689. uint key_version = buf_page_get_key_version(
  2690. src, callback.get_space_flags());
  2691. if (!encrypted) {
  2692. } else if (!key_version) {
  2693. not_encrypted:
  2694. if (block->page.id.page_no() == 0
  2695. && block->page.zip.data) {
  2696. block->page.zip.data = src;
  2697. frame_changed = true;
  2698. } else if (!page_compressed
  2699. && !block->page.zip.data) {
  2700. block->frame = src;
  2701. frame_changed = true;
  2702. } else {
  2703. ut_ad(dst != src);
  2704. memcpy(dst, src, size);
  2705. }
  2706. } else {
  2707. if (!buf_page_verify_crypt_checksum(
  2708. src, callback.get_space_flags())) {
  2709. goto page_corrupted;
  2710. }
  2711. decrypted = fil_space_decrypt(
  2712. actual_space_id,
  2713. iter.crypt_data, dst,
  2714. callback.physical_size(),
  2715. callback.get_space_flags(),
  2716. src, &err);
  2717. if (err != DB_SUCCESS) {
  2718. goto func_exit;
  2719. }
  2720. if (!decrypted) {
  2721. goto not_encrypted;
  2722. }
  2723. updated = true;
  2724. }
  2725. /* For full_crc32 format, skip checksum check
  2726. after decryption. */
  2727. bool skip_checksum_check = full_crc32 && encrypted;
  2728. /* If the original page is page_compressed, we need
  2729. to decompress it before adjusting further. */
  2730. if (page_compressed) {
  2731. ulint compress_length = fil_page_decompress(
  2732. page_compress_buf, dst,
  2733. callback.get_space_flags());
  2734. ut_ad(compress_length != srv_page_size);
  2735. if (compress_length == 0) {
  2736. goto page_corrupted;
  2737. }
  2738. updated = true;
  2739. } else if (!skip_checksum_check
  2740. && buf_page_is_corrupted(
  2741. false,
  2742. encrypted && !frame_changed
  2743. ? dst : src,
  2744. callback.get_space_flags())) {
  2745. goto page_corrupted;
  2746. }
  2747. if ((err = callback(block)) != DB_SUCCESS) {
  2748. goto func_exit;
  2749. } else if (!updated) {
  2750. updated = buf_block_get_state(block)
  2751. == BUF_BLOCK_FILE_PAGE;
  2752. }
  2753. /* If tablespace is encrypted we use additional
  2754. temporary scratch area where pages are read
  2755. for decrypting readptr == crypt_io_buffer != io_buffer.
  2756. Destination for decryption is a buffer pool block
  2757. block->frame == dst == io_buffer that is updated.
  2758. Pages that did not require decryption even when
  2759. tablespace is marked as encrypted are not copied
  2760. instead block->frame is set to src == readptr.
  2761. For encryption we again use temporary scratch area
  2762. writeptr != io_buffer == dst
  2763. that is then written to the tablespace
  2764. (1) For normal tables io_buffer == dst == writeptr
  2765. (2) For only page compressed tables
  2766. io_buffer == dst == writeptr
  2767. (3) For encrypted (and page compressed)
  2768. readptr != io_buffer == dst != writeptr
  2769. */
  2770. ut_ad(!encrypted && !page_compressed ?
  2771. src == dst && dst == writeptr + (i * size):1);
  2772. ut_ad(page_compressed && !encrypted ?
  2773. src == dst && dst == writeptr + (i * size):1);
  2774. ut_ad(encrypted ?
  2775. src != dst && dst != writeptr + (i * size):1);
  2776. /* When tablespace is encrypted or compressed its
  2777. first page (i.e. page 0) is not encrypted or
  2778. compressed and there is no need to copy frame. */
  2779. if (encrypted && block->page.id.page_no() != 0) {
  2780. byte *local_frame = callback.get_frame(block);
  2781. ut_ad((writeptr + (i * size)) != local_frame);
  2782. memcpy((writeptr + (i * size)), local_frame, size);
  2783. }
  2784. if (frame_changed) {
  2785. if (block->page.zip.data) {
  2786. block->page.zip.data = dst;
  2787. } else {
  2788. block->frame = dst;
  2789. }
  2790. }
  2791. src = io_buffer + (i * size);
  2792. if (page_compressed) {
  2793. updated = true;
  2794. if (ulint len = fil_page_compress(
  2795. src,
  2796. page_compress_buf,
  2797. callback.get_space_flags(),
  2798. 512,/* FIXME: proper block size */
  2799. encrypted)) {
  2800. /* FIXME: remove memcpy() */
  2801. memcpy(src, page_compress_buf, len);
  2802. memset(src + len, 0,
  2803. srv_page_size - len);
  2804. }
  2805. }
  2806. /* Encrypt the page if encryption was used. */
  2807. if (encrypted && decrypted) {
  2808. byte *dest = writeptr + i * size;
  2809. byte* tmp = fil_encrypt_buf(
  2810. iter.crypt_data,
  2811. block->page.id.space(),
  2812. block->page.id.page_no(),
  2813. src, block->zip_size(), dest,
  2814. full_crc32);
  2815. if (tmp == src) {
  2816. /* TODO: remove unnecessary memcpy's */
  2817. ut_ad(dest != src);
  2818. memcpy(dest, src, size);
  2819. }
  2820. updated = true;
  2821. }
  2822. /* Write checksum for the compressed full crc32 page.*/
  2823. if (full_crc32 && page_compressed) {
  2824. ut_ad(updated);
  2825. byte* dest = writeptr + i * size;
  2826. ut_d(bool comp = false);
  2827. ut_d(bool corrupt = false);
  2828. ulint size = buf_page_full_crc32_size(
  2829. dest,
  2830. #ifdef UNIV_DEBUG
  2831. &comp, &corrupt
  2832. #else
  2833. NULL, NULL
  2834. #endif
  2835. );
  2836. ut_ad(!comp == (size == srv_page_size));
  2837. ut_ad(!corrupt);
  2838. mach_write_to_4(dest + (size - 4),
  2839. ut_crc32(dest, size - 4));
  2840. }
  2841. }
  2842. /* A page was updated in the set, write back to disk. */
  2843. if (updated) {
  2844. IORequest write_request(IORequest::WRITE);
  2845. err = os_file_write(write_request,
  2846. iter.filepath, iter.file,
  2847. writeptr, offset, n_bytes);
  2848. if (err != DB_SUCCESS) {
  2849. goto func_exit;
  2850. }
  2851. }
  2852. }
  2853. func_exit:
  2854. free(page_compress_buf);
  2855. return err;
  2856. }
  2857. /********************************************************************//**
  2858. Iterate over all the pages in the tablespace.
  2859. @param table - the table definiton in the server
  2860. @param n_io_buffers - number of blocks to read and write together
  2861. @param callback - functor that will do the page updates
  2862. @return DB_SUCCESS or error code */
  2863. static
  2864. dberr_t
  2865. fil_tablespace_iterate(
  2866. /*===================*/
  2867. dict_table_t* table,
  2868. ulint n_io_buffers,
  2869. AbstractCallback& callback)
  2870. {
  2871. dberr_t err;
  2872. pfs_os_file_t file;
  2873. char* filepath;
  2874. ut_a(n_io_buffers > 0);
  2875. ut_ad(!srv_read_only_mode);
  2876. DBUG_EXECUTE_IF("ib_import_trigger_corruption_1",
  2877. return(DB_CORRUPTION););
  2878. /* Make sure the data_dir_path is set. */
  2879. dict_get_and_save_data_dir_path(table, false);
  2880. if (DICT_TF_HAS_DATA_DIR(table->flags)) {
  2881. ut_a(table->data_dir_path);
  2882. filepath = fil_make_filepath(
  2883. table->data_dir_path, table->name.m_name, IBD, true);
  2884. } else {
  2885. filepath = fil_make_filepath(
  2886. NULL, table->name.m_name, IBD, false);
  2887. }
  2888. if (!filepath) {
  2889. return(DB_OUT_OF_MEMORY);
  2890. } else {
  2891. bool success;
  2892. file = os_file_create_simple_no_error_handling(
  2893. innodb_data_file_key, filepath,
  2894. OS_FILE_OPEN, OS_FILE_READ_WRITE, false, &success);
  2895. if (!success) {
  2896. /* The following call prints an error message */
  2897. os_file_get_last_error(true);
  2898. ib::error() << "Trying to import a tablespace,"
  2899. " but could not open the tablespace file "
  2900. << filepath;
  2901. ut_free(filepath);
  2902. return DB_TABLESPACE_NOT_FOUND;
  2903. } else {
  2904. err = DB_SUCCESS;
  2905. }
  2906. }
  2907. callback.set_file(filepath, file);
  2908. os_offset_t file_size = os_file_get_size(file);
  2909. ut_a(file_size != (os_offset_t) -1);
  2910. /* Allocate a page to read in the tablespace header, so that we
  2911. can determine the page size and zip_size (if it is compressed).
  2912. We allocate an extra page in case it is a compressed table. One
  2913. page is to ensure alignement. */
  2914. void* page_ptr = ut_malloc_nokey(3U << srv_page_size_shift);
  2915. byte* page = static_cast<byte*>(ut_align(page_ptr, srv_page_size));
  2916. buf_block_t* block = reinterpret_cast<buf_block_t*>
  2917. (ut_zalloc_nokey(sizeof *block));
  2918. block->frame = page;
  2919. block->page.id = page_id_t(0, 0);
  2920. block->page.io_fix = BUF_IO_NONE;
  2921. block->page.buf_fix_count = 1;
  2922. block->page.state = BUF_BLOCK_FILE_PAGE;
  2923. /* Read the first page and determine the page and zip size. */
  2924. IORequest request(IORequest::READ);
  2925. request.disable_partial_io_warnings();
  2926. err = os_file_read_no_error_handling(request, file, page, 0,
  2927. srv_page_size, 0);
  2928. if (err == DB_SUCCESS) {
  2929. err = callback.init(file_size, block);
  2930. }
  2931. if (err == DB_SUCCESS) {
  2932. block->page.id = page_id_t(callback.get_space_id(), 0);
  2933. if (ulint zip_size = callback.get_zip_size()) {
  2934. page_zip_set_size(&block->page.zip, zip_size);
  2935. /* ROW_FORMAT=COMPRESSED is not optimised for block IO
  2936. for now. We do the IMPORT page by page. */
  2937. n_io_buffers = 1;
  2938. }
  2939. fil_iterator_t iter;
  2940. /* read (optional) crypt data */
  2941. iter.crypt_data = fil_space_read_crypt_data(
  2942. callback.get_zip_size(), page);
  2943. /* If tablespace is encrypted, it needs extra buffers */
  2944. if (iter.crypt_data && n_io_buffers > 1) {
  2945. /* decrease io buffers so that memory
  2946. consumption will not double */
  2947. n_io_buffers /= 2;
  2948. }
  2949. iter.file = file;
  2950. iter.start = 0;
  2951. iter.end = file_size;
  2952. iter.filepath = filepath;
  2953. iter.file_size = file_size;
  2954. iter.n_io_buffers = n_io_buffers;
  2955. /* Add an extra page for compressed page scratch area. */
  2956. void* io_buffer = ut_malloc_nokey(
  2957. (2 + iter.n_io_buffers) << srv_page_size_shift);
  2958. iter.io_buffer = static_cast<byte*>(
  2959. ut_align(io_buffer, srv_page_size));
  2960. void* crypt_io_buffer = NULL;
  2961. if (iter.crypt_data) {
  2962. crypt_io_buffer = ut_malloc_nokey(
  2963. (2 + iter.n_io_buffers)
  2964. << srv_page_size_shift);
  2965. iter.crypt_io_buffer = static_cast<byte*>(
  2966. ut_align(crypt_io_buffer, srv_page_size));
  2967. }
  2968. if (block->page.zip.ssize) {
  2969. ut_ad(iter.n_io_buffers == 1);
  2970. block->frame = iter.io_buffer;
  2971. block->page.zip.data = block->frame + srv_page_size;
  2972. }
  2973. err = fil_iterate(iter, block, callback);
  2974. if (iter.crypt_data) {
  2975. fil_space_destroy_crypt_data(&iter.crypt_data);
  2976. }
  2977. ut_free(crypt_io_buffer);
  2978. ut_free(io_buffer);
  2979. }
  2980. if (err == DB_SUCCESS) {
  2981. ib::info() << "Sync to disk";
  2982. if (!os_file_flush(file)) {
  2983. ib::info() << "os_file_flush() failed!";
  2984. err = DB_IO_ERROR;
  2985. } else {
  2986. ib::info() << "Sync to disk - done!";
  2987. }
  2988. }
  2989. os_file_close(file);
  2990. ut_free(page_ptr);
  2991. ut_free(filepath);
  2992. ut_free(block);
  2993. return(err);
  2994. }
  2995. /*****************************************************************//**
  2996. Imports a tablespace. The space id in the .ibd file must match the space id
  2997. of the table in the data dictionary.
  2998. @return error code or DB_SUCCESS */
  2999. dberr_t
  3000. row_import_for_mysql(
  3001. /*=================*/
  3002. dict_table_t* table, /*!< in/out: table */
  3003. row_prebuilt_t* prebuilt) /*!< in: prebuilt struct in MySQL */
  3004. {
  3005. dberr_t err;
  3006. trx_t* trx;
  3007. ib_uint64_t autoinc = 0;
  3008. char* filepath = NULL;
  3009. ulint space_flags MY_ATTRIBUTE((unused));
  3010. /* The caller assured that this is not read_only_mode and that no
  3011. temorary tablespace is being imported. */
  3012. ut_ad(!srv_read_only_mode);
  3013. ut_ad(!table->is_temporary());
  3014. ut_ad(table->space_id);
  3015. ut_ad(table->space_id < SRV_LOG_SPACE_FIRST_ID);
  3016. ut_ad(prebuilt->trx);
  3017. ut_ad(!table->is_readable());
  3018. ibuf_delete_for_discarded_space(table->space_id);
  3019. trx_start_if_not_started(prebuilt->trx, true);
  3020. trx = trx_create();
  3021. /* So that the table is not DROPped during recovery. */
  3022. trx_set_dict_operation(trx, TRX_DICT_OP_INDEX);
  3023. trx_start_if_not_started(trx, true);
  3024. /* So that we can send error messages to the user. */
  3025. trx->mysql_thd = prebuilt->trx->mysql_thd;
  3026. /* Ensure that the table will be dropped by trx_rollback_active()
  3027. in case of a crash. */
  3028. trx->table_id = table->id;
  3029. /* Assign an undo segment for the transaction, so that the
  3030. transaction will be recovered after a crash. */
  3031. /* TODO: Do not write any undo log for the IMPORT cleanup. */
  3032. {
  3033. mtr_t mtr;
  3034. mtr.start();
  3035. trx_undo_assign(trx, &err, &mtr);
  3036. mtr.commit();
  3037. }
  3038. DBUG_EXECUTE_IF("ib_import_undo_assign_failure",
  3039. err = DB_TOO_MANY_CONCURRENT_TRXS;);
  3040. if (err != DB_SUCCESS) {
  3041. return(row_import_cleanup(prebuilt, trx, err));
  3042. } else if (trx->rsegs.m_redo.undo == 0) {
  3043. err = DB_TOO_MANY_CONCURRENT_TRXS;
  3044. return(row_import_cleanup(prebuilt, trx, err));
  3045. }
  3046. prebuilt->trx->op_info = "read meta-data file";
  3047. /* Prevent DDL operations while we are checking. */
  3048. rw_lock_s_lock(&dict_sys.latch);
  3049. row_import cfg;
  3050. err = row_import_read_cfg(table, trx->mysql_thd, cfg);
  3051. /* Check if the table column definitions match the contents
  3052. of the config file. */
  3053. if (err == DB_SUCCESS) {
  3054. /* We have a schema file, try and match it with our
  3055. data dictionary. */
  3056. err = cfg.match_schema(trx->mysql_thd);
  3057. /* Update index->page and SYS_INDEXES.PAGE_NO to match the
  3058. B-tree root page numbers in the tablespace. Use the index
  3059. name from the .cfg file to find match. */
  3060. if (err == DB_SUCCESS) {
  3061. cfg.set_root_by_name();
  3062. autoinc = cfg.m_autoinc;
  3063. }
  3064. rw_lock_s_unlock(&dict_sys.latch);
  3065. DBUG_EXECUTE_IF("ib_import_set_index_root_failure",
  3066. err = DB_TOO_MANY_CONCURRENT_TRXS;);
  3067. } else if (cfg.m_missing) {
  3068. rw_lock_s_unlock(&dict_sys.latch);
  3069. /* We don't have a schema file, we will have to discover
  3070. the index root pages from the .ibd file and skip the schema
  3071. matching step. */
  3072. ut_a(err == DB_FAIL);
  3073. cfg.m_zip_size = 0;
  3074. FetchIndexRootPages fetchIndexRootPages(table, trx);
  3075. err = fil_tablespace_iterate(
  3076. table, IO_BUFFER_SIZE(srv_page_size),
  3077. fetchIndexRootPages);
  3078. if (err == DB_SUCCESS) {
  3079. err = fetchIndexRootPages.build_row_import(&cfg);
  3080. /* Update index->page and SYS_INDEXES.PAGE_NO
  3081. to match the B-tree root page numbers in the
  3082. tablespace. */
  3083. if (err == DB_SUCCESS) {
  3084. err = cfg.set_root_by_heuristic();
  3085. }
  3086. }
  3087. space_flags = fetchIndexRootPages.get_space_flags();
  3088. } else {
  3089. rw_lock_s_unlock(&dict_sys.latch);
  3090. }
  3091. if (err != DB_SUCCESS) {
  3092. return(row_import_error(prebuilt, trx, err));
  3093. }
  3094. prebuilt->trx->op_info = "importing tablespace";
  3095. ib::info() << "Phase I - Update all pages";
  3096. /* Iterate over all the pages and do the sanity checking and
  3097. the conversion required to import the tablespace. */
  3098. PageConverter converter(&cfg, table->space_id, trx);
  3099. /* Set the IO buffer size in pages. */
  3100. err = fil_tablespace_iterate(
  3101. table, IO_BUFFER_SIZE(cfg.m_zip_size ? cfg.m_zip_size
  3102. : srv_page_size), converter);
  3103. DBUG_EXECUTE_IF("ib_import_reset_space_and_lsn_failure",
  3104. err = DB_TOO_MANY_CONCURRENT_TRXS;);
  3105. #ifdef BTR_CUR_HASH_ADAPT
  3106. /* On DISCARD TABLESPACE, we did not drop any adaptive hash
  3107. index entries. If we replaced the discarded tablespace with a
  3108. smaller one here, there could still be some adaptive hash
  3109. index entries that point to cached garbage pages in the buffer
  3110. pool, because PageConverter::operator() only evicted those
  3111. pages that were replaced by the imported pages. We must
  3112. discard all remaining adaptive hash index entries, because the
  3113. adaptive hash index must be a subset of the table contents;
  3114. false positives are not tolerated. */
  3115. while (buf_LRU_drop_page_hash_for_tablespace(table)) {
  3116. if (trx_is_interrupted(trx)
  3117. || srv_shutdown_state != SRV_SHUTDOWN_NONE) {
  3118. err = DB_INTERRUPTED;
  3119. break;
  3120. }
  3121. }
  3122. #endif /* BTR_CUR_HASH_ADAPT */
  3123. if (err != DB_SUCCESS) {
  3124. char table_name[MAX_FULL_NAME_LEN + 1];
  3125. innobase_format_name(
  3126. table_name, sizeof(table_name),
  3127. table->name.m_name);
  3128. if (err != DB_DECRYPTION_FAILED) {
  3129. ib_errf(trx->mysql_thd, IB_LOG_LEVEL_ERROR,
  3130. ER_INTERNAL_ERROR,
  3131. "Cannot reset LSNs in table %s : %s",
  3132. table_name, ut_strerr(err));
  3133. }
  3134. return(row_import_cleanup(prebuilt, trx, err));
  3135. }
  3136. row_mysql_lock_data_dictionary(trx);
  3137. /* If the table is stored in a remote tablespace, we need to
  3138. determine that filepath from the link file and system tables.
  3139. Find the space ID in SYS_TABLES since this is an ALTER TABLE. */
  3140. dict_get_and_save_data_dir_path(table, true);
  3141. if (DICT_TF_HAS_DATA_DIR(table->flags)) {
  3142. ut_a(table->data_dir_path);
  3143. filepath = fil_make_filepath(
  3144. table->data_dir_path, table->name.m_name, IBD, true);
  3145. } else {
  3146. filepath = fil_make_filepath(
  3147. NULL, table->name.m_name, IBD, false);
  3148. }
  3149. DBUG_EXECUTE_IF(
  3150. "ib_import_OOM_15",
  3151. ut_free(filepath);
  3152. filepath = NULL;
  3153. );
  3154. if (filepath == NULL) {
  3155. row_mysql_unlock_data_dictionary(trx);
  3156. return(row_import_cleanup(prebuilt, trx, DB_OUT_OF_MEMORY));
  3157. }
  3158. /* Open the tablespace so that we can access via the buffer pool.
  3159. We set the 2nd param (fix_dict = true) here because we already
  3160. have an x-lock on dict_sys.latch and dict_sys.mutex.
  3161. The tablespace is initially opened as a temporary one, because
  3162. we will not be writing any redo log for it before we have invoked
  3163. fil_space_t::set_imported() to declare it a persistent tablespace. */
  3164. ulint fsp_flags = dict_tf_to_fsp_flags(table->flags);
  3165. table->space = fil_ibd_open(
  3166. true, true, FIL_TYPE_IMPORT, table->space_id,
  3167. fsp_flags, table->name, filepath, &err);
  3168. ut_ad((table->space == NULL) == (err != DB_SUCCESS));
  3169. DBUG_EXECUTE_IF("ib_import_open_tablespace_failure",
  3170. err = DB_TABLESPACE_NOT_FOUND; table->space = NULL;);
  3171. if (!table->space) {
  3172. row_mysql_unlock_data_dictionary(trx);
  3173. ib_senderrf(trx->mysql_thd, IB_LOG_LEVEL_ERROR,
  3174. ER_GET_ERRMSG,
  3175. err, ut_strerr(err), filepath);
  3176. ut_free(filepath);
  3177. return(row_import_cleanup(prebuilt, trx, err));
  3178. }
  3179. row_mysql_unlock_data_dictionary(trx);
  3180. ut_free(filepath);
  3181. err = ibuf_check_bitmap_on_import(trx, table->space);
  3182. DBUG_EXECUTE_IF("ib_import_check_bitmap_failure", err = DB_CORRUPTION;);
  3183. if (err != DB_SUCCESS) {
  3184. return(row_import_cleanup(prebuilt, trx, err));
  3185. }
  3186. /* The first index must always be the clustered index. */
  3187. dict_index_t* index = dict_table_get_first_index(table);
  3188. if (!dict_index_is_clust(index)) {
  3189. return(row_import_error(prebuilt, trx, DB_CORRUPTION));
  3190. }
  3191. /* Update the Btree segment headers for index node and
  3192. leaf nodes in the root page. Set the new space id. */
  3193. err = btr_root_adjust_on_import(index);
  3194. DBUG_EXECUTE_IF("ib_import_cluster_root_adjust_failure",
  3195. err = DB_CORRUPTION;);
  3196. if (err != DB_SUCCESS) {
  3197. return(row_import_error(prebuilt, trx, err));
  3198. } else if (cfg.requires_purge(index->name)) {
  3199. /* Purge any delete-marked records that couldn't be
  3200. purged during the page conversion phase from the
  3201. cluster index. */
  3202. IndexPurge purge(trx, index);
  3203. trx->op_info = "cluster: purging delete marked records";
  3204. err = purge.garbage_collect();
  3205. trx->op_info = "";
  3206. }
  3207. DBUG_EXECUTE_IF("ib_import_cluster_failure", err = DB_CORRUPTION;);
  3208. if (err != DB_SUCCESS) {
  3209. return(row_import_error(prebuilt, trx, err));
  3210. }
  3211. /* For secondary indexes, purge any records that couldn't be purged
  3212. during the page conversion phase. */
  3213. err = row_import_adjust_root_pages_of_secondary_indexes(
  3214. trx, table, cfg);
  3215. DBUG_EXECUTE_IF("ib_import_sec_root_adjust_failure",
  3216. err = DB_CORRUPTION;);
  3217. if (err != DB_SUCCESS) {
  3218. return(row_import_error(prebuilt, trx, err));
  3219. }
  3220. /* Ensure that the next available DB_ROW_ID is not smaller than
  3221. any DB_ROW_ID stored in the table. */
  3222. if (prebuilt->clust_index_was_generated) {
  3223. row_import_set_sys_max_row_id(prebuilt, table);
  3224. }
  3225. ib::info() << "Phase III - Flush changes to disk";
  3226. /* Ensure that all pages dirtied during the IMPORT make it to disk.
  3227. The only dirty pages generated should be from the pessimistic purge
  3228. of delete marked records that couldn't be purged in Phase I. */
  3229. {
  3230. FlushObserver observer(prebuilt->table->space, trx, NULL);
  3231. buf_LRU_flush_or_remove_pages(prebuilt->table->space_id,
  3232. &observer);
  3233. if (observer.is_interrupted()) {
  3234. ib::info() << "Phase III - Flush interrupted";
  3235. return(row_import_error(prebuilt, trx,
  3236. DB_INTERRUPTED));
  3237. }
  3238. }
  3239. ib::info() << "Phase IV - Flush complete";
  3240. prebuilt->table->space->set_imported();
  3241. /* The dictionary latches will be released in in row_import_cleanup()
  3242. after the transaction commit, for both success and error. */
  3243. row_mysql_lock_data_dictionary(trx);
  3244. /* Update the root pages of the table's indexes. */
  3245. err = row_import_update_index_root(trx, table, false);
  3246. if (err != DB_SUCCESS) {
  3247. return(row_import_error(prebuilt, trx, err));
  3248. }
  3249. err = row_import_update_discarded_flag(trx, table->id, false);
  3250. if (err != DB_SUCCESS) {
  3251. return(row_import_error(prebuilt, trx, err));
  3252. }
  3253. table->file_unreadable = false;
  3254. table->flags2 &= ~DICT_TF2_DISCARDED;
  3255. /* Set autoinc value read from .cfg file, if one was specified.
  3256. Otherwise, keep the PAGE_ROOT_AUTO_INC as is. */
  3257. if (autoinc) {
  3258. ib::info() << table->name << " autoinc value set to "
  3259. << autoinc;
  3260. table->autoinc = autoinc--;
  3261. btr_write_autoinc(dict_table_get_first_index(table), autoinc);
  3262. }
  3263. return(row_import_cleanup(prebuilt, trx, err));
  3264. }