You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

5079 lines
145 KiB

9 years ago
MDEV-18644: Support full_crc32 for page_compressed This is a follow-up task to MDEV-12026, which introduced innodb_checksum_algorithm=full_crc32 and a simpler page format. MDEV-12026 did not enable full_crc32 for page_compressed tables, which we will be doing now. This is joint work with Thirunarayanan Balathandayuthapani. For innodb_checksum_algorithm=full_crc32 we change the page_compressed format as follows: FIL_PAGE_TYPE: The most significant bit will be set to indicate page_compressed format. The least significant bits will contain the compressed page size, rounded up to a multiple of 256 bytes. The checksum will be stored in the last 4 bytes of the page (whether it is the full page or a page_compressed page whose size is determined by FIL_PAGE_TYPE), covering all preceding bytes of the page. If encryption is used, then the page will be encrypted between compression and computing the checksum. For page_compressed, FIL_PAGE_LSN will not be repeated at the end of the page. FSP_SPACE_FLAGS (already implemented as part of MDEV-12026): We will store the innodb_compression_algorithm that may be used to compress pages. Previously, the choice of algorithm was written to each compressed data page separately, and one would be unable to know in advance which compression algorithm(s) are used. fil_space_t::full_crc32_page_compressed_len(): Determine if the page_compressed algorithm of the tablespace needs to know the exact length of the compressed data. If yes, we will reserve and write an extra byte for this right before the checksum. buf_page_is_compressed(): Determine if a page uses page_compressed (in any innodb_checksum_algorithm). fil_page_decompress(): Pass also fil_space_t::flags so that the format can be determined. buf_page_is_zeroes(): Check if a page is full of zero bytes. buf_page_full_crc32_is_corrupted(): Renamed from buf_encrypted_full_crc32_page_is_corrupted(). For full_crc32, we always simply validate the checksum to the page contents, while the physical page size is explicitly specified by an unencrypted part of the page header. buf_page_full_crc32_size(): Determine the size of a full_crc32 page. buf_dblwr_check_page_lsn(): Make this a debug-only function, because it involves potentially costly lookups of fil_space_t. create_table_info_t::check_table_options(), ha_innobase::check_if_supported_inplace_alter(): Do allow the creation of SPATIAL INDEX with full_crc32 also when page_compressed is used. commit_cache_norebuild(): Preserve the compression algorithm when updating the page_compression_level. dict_tf_to_fsp_flags(): Set the flags for page compression algorithm. FIXME: Maybe there should be a table option page_compression_algorithm and a session variable to back it?
7 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
9 years ago
12 years ago
9 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
12 years ago
12 years ago
12 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
12 years ago
12 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
12 years ago
12 years ago
12 years ago
9 years ago
12 years ago
12 years ago
12 years ago
12 years ago
MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t*, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t* as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.
8 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
12 years ago
12 years ago
12 years ago
12 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
12 years ago
12 years ago
12 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
12 years ago
12 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
12 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
12 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed. [TODO: It appears that the resetting is not taking place as often as it could be. We should test that a simple INSERT should eventually cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is invoked soon enough.] The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR are used by multi-versioning. After the history is no longer needed, these columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert). When a reader sees 0 in the DB_TRX_ID column, it can instantly determine that the record is present the read view. There is no need to acquire the transaction system mutex to check if the transaction exists, because writes can never be conducted by a transaction whose ID is 0. The persistent InnoDB undo log used to be split into two parts: insert_undo and update_undo. The insert_undo log was discarded at transaction commit or rollback, and the update_undo log was processed by the purge subsystem. As part of this change, we will only generate a single undo log for new transactions, and the purge subsystem will reset the DB_TRX_ID whenever a clustered index record is touched. That is, all persistent undo log will be preserved at transaction commit or rollback, to be removed by purge. The InnoDB redo log format is changed in two ways: We remove the redo log record type MLOG_UNDO_HDR_REUSE, and we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table. This is also changing the format of persistent InnoDB data files: undo log and clustered index leaf page records. It will still be possible via import and export to exchange data files with earlier versions of MariaDB. The change to clustered index leaf page records is simple: we allow DB_TRX_ID to be 0. When it comes to the undo log, we must be able to upgrade from earlier MariaDB versions after a clean shutdown (no redo log to apply). While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0) before an upgrade, to empty the undo logs, we cannot assume that this has been done. So, separate insert_undo log may exist for recovered uncommitted transactions. These transactions may be automatically rolled back, or they may be in XA PREPARE state, in which case InnoDB will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK. Upgrade has been tested by starting up MariaDB 10.2 with ./mysql-test-run --manual-gdb innodb.read_only_recovery and then starting up this patched server with and without --innodb-read-only. trx_undo_ptr_t::undo: Renamed from update_undo. trx_undo_ptr_t::old_insert: Renamed from insert_undo. trx_rseg_t::undo_list: Renamed from update_undo_list. trx_rseg_t::undo_cached: Merged from update_undo_cached and insert_undo_cached. trx_rseg_t::old_insert_list: Renamed from insert_undo_list. row_purge_reset_trx_id(): New function to reset the columns. This will be called for all undo processing in purge that does not remove the clustered index record. trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the old DB_TRX_ID of the record to the undo log. ReadView::changes_visible(): Allow id==0. (Return true for it. This is what speeds up the MVCC.) row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read(): Implement a fast path for DB_TRX_ID=0. Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type. MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format! innobase_start_or_create_for_mysql(): Set srv_undo_sources before starting any transactions. The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully tested by running the following: ./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680 grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
8 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
12 years ago
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with the notable difference that the file format changes are limited to repurposing a previously unused data field in B-tree pages. For persistent InnoDB tables, write the last used AUTO_INCREMENT value to the root page of the clustered index, in the previously unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC. Unlike some other previously unused InnoDB data fields, this one was actually always zero-initialized, at least since MySQL 3.23.49. The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the root page. The SX latch will allow concurrent read access to the root page. (The field PAGE_ROOT_AUTO_INC will only be read on the first-time call to ha_innobase::open() from the SQL layer. The PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so read/write races are not possible.) During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level function btr_cur_search_to_nth_level(), adding no extra page access. [Adaptive hash index lookup will be disabled during INSERT.] If some rare UPDATE modifies an AUTO_INCREMENT column, the PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in ha_innobase::update_row(). When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC field. During ALTER TABLE, the initial AUTO_INCREMENT value will be copied from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will update PAGE_ROOT_AUTO_INC in real time. innodb_col_no(): Determine the dict_table_t::cols[] element index corresponding to a Field of a non-virtual column. (The MySQL 5.7 implementation of virtual columns breaks the 1:1 relationship between Field::field_index and dict_table_t::cols[]. Virtual columns are omitted from dict_table_t::cols[]. Therefore, we must translate the field_index of AUTO_INCREMENT columns into an index of dict_table_t::cols[].) Upgrade from old data files: By default, the AUTO_INCREMENT sequence in old data files would appear to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain the value 0 in each clustered index page. In new data files, PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain any AUTO_INCREMENT column. For backward compatibility, we use the old method of SELECT MAX(auto_increment_column) for initializing the sequence. btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format data file. btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc() that will resort to reading MAX(auto_increment_column) for data files that did not use AUTO_INCREMENT yet. It was manually tested that during the execution of innodb.autoinc_persist the compatibility logic is not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty clustered index root pages). initialize_auto_increment(): Replaces ha_innobase::innobase_initialize_autoinc(). This initializes the AUTO_INCREMENT metadata. Only called from ha_innobase::open(). ha_innobase::info_low(): Do not try to lazily initialize dict_table_t::autoinc. It must already have been initialized by ha_innobase::open() or ha_innobase::create(). Note: The adjustments to class ha_innopart were not tested, because the source code (native InnoDB partitioning) is not being compiled.
9 years ago
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with the notable difference that the file format changes are limited to repurposing a previously unused data field in B-tree pages. For persistent InnoDB tables, write the last used AUTO_INCREMENT value to the root page of the clustered index, in the previously unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC. Unlike some other previously unused InnoDB data fields, this one was actually always zero-initialized, at least since MySQL 3.23.49. The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the root page. The SX latch will allow concurrent read access to the root page. (The field PAGE_ROOT_AUTO_INC will only be read on the first-time call to ha_innobase::open() from the SQL layer. The PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so read/write races are not possible.) During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level function btr_cur_search_to_nth_level(), adding no extra page access. [Adaptive hash index lookup will be disabled during INSERT.] If some rare UPDATE modifies an AUTO_INCREMENT column, the PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in ha_innobase::update_row(). When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC field. During ALTER TABLE, the initial AUTO_INCREMENT value will be copied from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will update PAGE_ROOT_AUTO_INC in real time. innodb_col_no(): Determine the dict_table_t::cols[] element index corresponding to a Field of a non-virtual column. (The MySQL 5.7 implementation of virtual columns breaks the 1:1 relationship between Field::field_index and dict_table_t::cols[]. Virtual columns are omitted from dict_table_t::cols[]. Therefore, we must translate the field_index of AUTO_INCREMENT columns into an index of dict_table_t::cols[].) Upgrade from old data files: By default, the AUTO_INCREMENT sequence in old data files would appear to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain the value 0 in each clustered index page. In new data files, PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain any AUTO_INCREMENT column. For backward compatibility, we use the old method of SELECT MAX(auto_increment_column) for initializing the sequence. btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format data file. btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc() that will resort to reading MAX(auto_increment_column) for data files that did not use AUTO_INCREMENT yet. It was manually tested that during the execution of innodb.autoinc_persist the compatibility logic is not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty clustered index root pages). initialize_auto_increment(): Replaces ha_innobase::innobase_initialize_autoinc(). This initializes the AUTO_INCREMENT metadata. Only called from ha_innobase::open(). ha_innobase::info_low(): Do not try to lazily initialize dict_table_t::autoinc. It must already have been initialized by ha_innobase::open() or ha_innobase::create(). Note: The adjustments to class ha_innopart were not tested, because the source code (native InnoDB partitioning) is not being compiled.
9 years ago
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with the notable difference that the file format changes are limited to repurposing a previously unused data field in B-tree pages. For persistent InnoDB tables, write the last used AUTO_INCREMENT value to the root page of the clustered index, in the previously unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC. Unlike some other previously unused InnoDB data fields, this one was actually always zero-initialized, at least since MySQL 3.23.49. The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the root page. The SX latch will allow concurrent read access to the root page. (The field PAGE_ROOT_AUTO_INC will only be read on the first-time call to ha_innobase::open() from the SQL layer. The PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so read/write races are not possible.) During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level function btr_cur_search_to_nth_level(), adding no extra page access. [Adaptive hash index lookup will be disabled during INSERT.] If some rare UPDATE modifies an AUTO_INCREMENT column, the PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in ha_innobase::update_row(). When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC field. During ALTER TABLE, the initial AUTO_INCREMENT value will be copied from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will update PAGE_ROOT_AUTO_INC in real time. innodb_col_no(): Determine the dict_table_t::cols[] element index corresponding to a Field of a non-virtual column. (The MySQL 5.7 implementation of virtual columns breaks the 1:1 relationship between Field::field_index and dict_table_t::cols[]. Virtual columns are omitted from dict_table_t::cols[]. Therefore, we must translate the field_index of AUTO_INCREMENT columns into an index of dict_table_t::cols[].) Upgrade from old data files: By default, the AUTO_INCREMENT sequence in old data files would appear to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain the value 0 in each clustered index page. In new data files, PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain any AUTO_INCREMENT column. For backward compatibility, we use the old method of SELECT MAX(auto_increment_column) for initializing the sequence. btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format data file. btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc() that will resort to reading MAX(auto_increment_column) for data files that did not use AUTO_INCREMENT yet. It was manually tested that during the execution of innodb.autoinc_persist the compatibility logic is not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty clustered index root pages). initialize_auto_increment(): Replaces ha_innobase::innobase_initialize_autoinc(). This initializes the AUTO_INCREMENT metadata. Only called from ha_innobase::open(). ha_innobase::info_low(): Do not try to lazily initialize dict_table_t::autoinc. It must already have been initialized by ha_innobase::open() or ha_innobase::create(). Note: The adjustments to class ha_innopart were not tested, because the source code (native InnoDB partitioning) is not being compiled.
9 years ago
9 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
MDEV-17958 Make bug-endian innodb_checksum_algorithm=crc32 optional In MySQL 5.7, it was noticed that files are not portable between big-endian and little-endian processor architectures (such as SPARC and x86), because the original implementation of innodb_checksum_algorithm=crc32 was not byte order agnostic. A byte order agnostic implementation of innodb_checksum_algorithm=crc32 was only added to MySQL 5.7, not backported to 5.6. Consequently, MariaDB Server versions 10.0 and 10.1 only contain the CRC-32C implementation that works incorrectly on big-endian architectures, and MariaDB Server 10.2.2 got the byte-order agnostic CRC-32C implementation from MySQL 5.7. MySQL 5.7 introduced a "legacy crc32" variant that is functionally equivalent to the big-endian version of the original crc32 implementation. Thanks to this variant, old data files can be transferred from big-endian systems to newer versions. Introducing new variants of checksum algorithms (without introducing new names for them, or something on the pages themselves to identify the algorithm) generally is a bad idea, because each checksum algorithm is like a lottery ticket. The more algorithms you try, the more likely it will be for the checksum to match on a corrupted page. So, essentially MySQL 5.7 weakened innodb_checksum_algorithm=crc32, and MariaDB 10.2.2 inherited this weakening. We introduce a build option that together with MDEV-17957 makes innodb_checksum_algorithm=strict_crc32 strict again by only allowing one variant of the checksum to match. WITH_INNODB_BUG_ENDIAN_CRC32: A new cmake option for enabling the bug-compatible "legacy crc32" checksum. This is only enabled on big-endian systems by default, to facilitate an upgrade from MariaDB 10.0 or 10.1. Checked by #ifdef INNODB_BUG_ENDIAN_CRC32. ut_crc32_byte_by_byte: Remove (unused function). legacy_big_endian_checksum: Remove. This variable seems to have unnecessarily complicated the logic. When the weakening is enabled, we must always fall back to the buggy checksum. buf_page_check_crc32(): A helper function to compute one or two CRC-32C variants.
7 years ago
12 years ago
MDEV-18644: Support full_crc32 for page_compressed This is a follow-up task to MDEV-12026, which introduced innodb_checksum_algorithm=full_crc32 and a simpler page format. MDEV-12026 did not enable full_crc32 for page_compressed tables, which we will be doing now. This is joint work with Thirunarayanan Balathandayuthapani. For innodb_checksum_algorithm=full_crc32 we change the page_compressed format as follows: FIL_PAGE_TYPE: The most significant bit will be set to indicate page_compressed format. The least significant bits will contain the compressed page size, rounded up to a multiple of 256 bytes. The checksum will be stored in the last 4 bytes of the page (whether it is the full page or a page_compressed page whose size is determined by FIL_PAGE_TYPE), covering all preceding bytes of the page. If encryption is used, then the page will be encrypted between compression and computing the checksum. For page_compressed, FIL_PAGE_LSN will not be repeated at the end of the page. FSP_SPACE_FLAGS (already implemented as part of MDEV-12026): We will store the innodb_compression_algorithm that may be used to compress pages. Previously, the choice of algorithm was written to each compressed data page separately, and one would be unable to know in advance which compression algorithm(s) are used. fil_space_t::full_crc32_page_compressed_len(): Determine if the page_compressed algorithm of the tablespace needs to know the exact length of the compressed data. If yes, we will reserve and write an extra byte for this right before the checksum. buf_page_is_compressed(): Determine if a page uses page_compressed (in any innodb_checksum_algorithm). fil_page_decompress(): Pass also fil_space_t::flags so that the format can be determined. buf_page_is_zeroes(): Check if a page is full of zero bytes. buf_page_full_crc32_is_corrupted(): Renamed from buf_encrypted_full_crc32_page_is_corrupted(). For full_crc32, we always simply validate the checksum to the page contents, while the physical page size is explicitly specified by an unencrypted part of the page header. buf_page_full_crc32_size(): Determine the size of a full_crc32 page. buf_dblwr_check_page_lsn(): Make this a debug-only function, because it involves potentially costly lookups of fil_space_t. create_table_info_t::check_table_options(), ha_innobase::check_if_supported_inplace_alter(): Do allow the creation of SPATIAL INDEX with full_crc32 also when page_compressed is used. commit_cache_norebuild(): Preserve the compression algorithm when updating the page_compression_level. dict_tf_to_fsp_flags(): Set the flags for page compression algorithm. FIXME: Maybe there should be a table option page_compression_algorithm and a session variable to back it?
7 years ago
MDEV-18644: Support full_crc32 for page_compressed This is a follow-up task to MDEV-12026, which introduced innodb_checksum_algorithm=full_crc32 and a simpler page format. MDEV-12026 did not enable full_crc32 for page_compressed tables, which we will be doing now. This is joint work with Thirunarayanan Balathandayuthapani. For innodb_checksum_algorithm=full_crc32 we change the page_compressed format as follows: FIL_PAGE_TYPE: The most significant bit will be set to indicate page_compressed format. The least significant bits will contain the compressed page size, rounded up to a multiple of 256 bytes. The checksum will be stored in the last 4 bytes of the page (whether it is the full page or a page_compressed page whose size is determined by FIL_PAGE_TYPE), covering all preceding bytes of the page. If encryption is used, then the page will be encrypted between compression and computing the checksum. For page_compressed, FIL_PAGE_LSN will not be repeated at the end of the page. FSP_SPACE_FLAGS (already implemented as part of MDEV-12026): We will store the innodb_compression_algorithm that may be used to compress pages. Previously, the choice of algorithm was written to each compressed data page separately, and one would be unable to know in advance which compression algorithm(s) are used. fil_space_t::full_crc32_page_compressed_len(): Determine if the page_compressed algorithm of the tablespace needs to know the exact length of the compressed data. If yes, we will reserve and write an extra byte for this right before the checksum. buf_page_is_compressed(): Determine if a page uses page_compressed (in any innodb_checksum_algorithm). fil_page_decompress(): Pass also fil_space_t::flags so that the format can be determined. buf_page_is_zeroes(): Check if a page is full of zero bytes. buf_page_full_crc32_is_corrupted(): Renamed from buf_encrypted_full_crc32_page_is_corrupted(). For full_crc32, we always simply validate the checksum to the page contents, while the physical page size is explicitly specified by an unencrypted part of the page header. buf_page_full_crc32_size(): Determine the size of a full_crc32 page. buf_dblwr_check_page_lsn(): Make this a debug-only function, because it involves potentially costly lookups of fil_space_t. create_table_info_t::check_table_options(), ha_innobase::check_if_supported_inplace_alter(): Do allow the creation of SPATIAL INDEX with full_crc32 also when page_compressed is used. commit_cache_norebuild(): Preserve the compression algorithm when updating the page_compression_level. dict_tf_to_fsp_flags(): Set the flags for page compression algorithm. FIXME: Maybe there should be a table option page_compression_algorithm and a session variable to back it?
7 years ago
MDEV-17958 Make bug-endian innodb_checksum_algorithm=crc32 optional In MySQL 5.7, it was noticed that files are not portable between big-endian and little-endian processor architectures (such as SPARC and x86), because the original implementation of innodb_checksum_algorithm=crc32 was not byte order agnostic. A byte order agnostic implementation of innodb_checksum_algorithm=crc32 was only added to MySQL 5.7, not backported to 5.6. Consequently, MariaDB Server versions 10.0 and 10.1 only contain the CRC-32C implementation that works incorrectly on big-endian architectures, and MariaDB Server 10.2.2 got the byte-order agnostic CRC-32C implementation from MySQL 5.7. MySQL 5.7 introduced a "legacy crc32" variant that is functionally equivalent to the big-endian version of the original crc32 implementation. Thanks to this variant, old data files can be transferred from big-endian systems to newer versions. Introducing new variants of checksum algorithms (without introducing new names for them, or something on the pages themselves to identify the algorithm) generally is a bad idea, because each checksum algorithm is like a lottery ticket. The more algorithms you try, the more likely it will be for the checksum to match on a corrupted page. So, essentially MySQL 5.7 weakened innodb_checksum_algorithm=crc32, and MariaDB 10.2.2 inherited this weakening. We introduce a build option that together with MDEV-17957 makes innodb_checksum_algorithm=strict_crc32 strict again by only allowing one variant of the checksum to match. WITH_INNODB_BUG_ENDIAN_CRC32: A new cmake option for enabling the bug-compatible "legacy crc32" checksum. This is only enabled on big-endian systems by default, to facilitate an upgrade from MariaDB 10.0 or 10.1. Checked by #ifdef INNODB_BUG_ENDIAN_CRC32. ut_crc32_byte_by_byte: Remove (unused function). legacy_big_endian_checksum: Remove. This variable seems to have unnecessarily complicated the logic. When the weakening is enabled, we must always fall back to the buggy checksum. buf_page_check_crc32(): A helper function to compute one or two CRC-32C variants.
7 years ago
11 years ago
12 years ago
12 years ago
11 years ago
10 years ago
MDEV-17958 Make bug-endian innodb_checksum_algorithm=crc32 optional In MySQL 5.7, it was noticed that files are not portable between big-endian and little-endian processor architectures (such as SPARC and x86), because the original implementation of innodb_checksum_algorithm=crc32 was not byte order agnostic. A byte order agnostic implementation of innodb_checksum_algorithm=crc32 was only added to MySQL 5.7, not backported to 5.6. Consequently, MariaDB Server versions 10.0 and 10.1 only contain the CRC-32C implementation that works incorrectly on big-endian architectures, and MariaDB Server 10.2.2 got the byte-order agnostic CRC-32C implementation from MySQL 5.7. MySQL 5.7 introduced a "legacy crc32" variant that is functionally equivalent to the big-endian version of the original crc32 implementation. Thanks to this variant, old data files can be transferred from big-endian systems to newer versions. Introducing new variants of checksum algorithms (without introducing new names for them, or something on the pages themselves to identify the algorithm) generally is a bad idea, because each checksum algorithm is like a lottery ticket. The more algorithms you try, the more likely it will be for the checksum to match on a corrupted page. So, essentially MySQL 5.7 weakened innodb_checksum_algorithm=crc32, and MariaDB 10.2.2 inherited this weakening. We introduce a build option that together with MDEV-17957 makes innodb_checksum_algorithm=strict_crc32 strict again by only allowing one variant of the checksum to match. WITH_INNODB_BUG_ENDIAN_CRC32: A new cmake option for enabling the bug-compatible "legacy crc32" checksum. This is only enabled on big-endian systems by default, to facilitate an upgrade from MariaDB 10.0 or 10.1. Checked by #ifdef INNODB_BUG_ENDIAN_CRC32. ut_crc32_byte_by_byte: Remove (unused function). legacy_big_endian_checksum: Remove. This variable seems to have unnecessarily complicated the logic. When the weakening is enabled, we must always fall back to the buggy checksum. buf_page_check_crc32(): A helper function to compute one or two CRC-32C variants.
7 years ago
10 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
10 years ago
MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.
7 years ago
10 years ago
10 years ago
MDEV-17958 Make bug-endian innodb_checksum_algorithm=crc32 optional In MySQL 5.7, it was noticed that files are not portable between big-endian and little-endian processor architectures (such as SPARC and x86), because the original implementation of innodb_checksum_algorithm=crc32 was not byte order agnostic. A byte order agnostic implementation of innodb_checksum_algorithm=crc32 was only added to MySQL 5.7, not backported to 5.6. Consequently, MariaDB Server versions 10.0 and 10.1 only contain the CRC-32C implementation that works incorrectly on big-endian architectures, and MariaDB Server 10.2.2 got the byte-order agnostic CRC-32C implementation from MySQL 5.7. MySQL 5.7 introduced a "legacy crc32" variant that is functionally equivalent to the big-endian version of the original crc32 implementation. Thanks to this variant, old data files can be transferred from big-endian systems to newer versions. Introducing new variants of checksum algorithms (without introducing new names for them, or something on the pages themselves to identify the algorithm) generally is a bad idea, because each checksum algorithm is like a lottery ticket. The more algorithms you try, the more likely it will be for the checksum to match on a corrupted page. So, essentially MySQL 5.7 weakened innodb_checksum_algorithm=crc32, and MariaDB 10.2.2 inherited this weakening. We introduce a build option that together with MDEV-17957 makes innodb_checksum_algorithm=strict_crc32 strict again by only allowing one variant of the checksum to match. WITH_INNODB_BUG_ENDIAN_CRC32: A new cmake option for enabling the bug-compatible "legacy crc32" checksum. This is only enabled on big-endian systems by default, to facilitate an upgrade from MariaDB 10.0 or 10.1. Checked by #ifdef INNODB_BUG_ENDIAN_CRC32. ut_crc32_byte_by_byte: Remove (unused function). legacy_big_endian_checksum: Remove. This variable seems to have unnecessarily complicated the logic. When the weakening is enabled, we must always fall back to the buggy checksum. buf_page_check_crc32(): A helper function to compute one or two CRC-32C variants.
7 years ago
MDEV-17958 Make bug-endian innodb_checksum_algorithm=crc32 optional In MySQL 5.7, it was noticed that files are not portable between big-endian and little-endian processor architectures (such as SPARC and x86), because the original implementation of innodb_checksum_algorithm=crc32 was not byte order agnostic. A byte order agnostic implementation of innodb_checksum_algorithm=crc32 was only added to MySQL 5.7, not backported to 5.6. Consequently, MariaDB Server versions 10.0 and 10.1 only contain the CRC-32C implementation that works incorrectly on big-endian architectures, and MariaDB Server 10.2.2 got the byte-order agnostic CRC-32C implementation from MySQL 5.7. MySQL 5.7 introduced a "legacy crc32" variant that is functionally equivalent to the big-endian version of the original crc32 implementation. Thanks to this variant, old data files can be transferred from big-endian systems to newer versions. Introducing new variants of checksum algorithms (without introducing new names for them, or something on the pages themselves to identify the algorithm) generally is a bad idea, because each checksum algorithm is like a lottery ticket. The more algorithms you try, the more likely it will be for the checksum to match on a corrupted page. So, essentially MySQL 5.7 weakened innodb_checksum_algorithm=crc32, and MariaDB 10.2.2 inherited this weakening. We introduce a build option that together with MDEV-17957 makes innodb_checksum_algorithm=strict_crc32 strict again by only allowing one variant of the checksum to match. WITH_INNODB_BUG_ENDIAN_CRC32: A new cmake option for enabling the bug-compatible "legacy crc32" checksum. This is only enabled on big-endian systems by default, to facilitate an upgrade from MariaDB 10.0 or 10.1. Checked by #ifdef INNODB_BUG_ENDIAN_CRC32. ut_crc32_byte_by_byte: Remove (unused function). legacy_big_endian_checksum: Remove. This variable seems to have unnecessarily complicated the logic. When the weakening is enabled, we must always fall back to the buggy checksum. buf_page_check_crc32(): A helper function to compute one or two CRC-32C variants.
7 years ago
  1. /*****************************************************************************
  2. Copyright (c) 2005, 2016, Oracle and/or its affiliates. All Rights Reserved.
  3. Copyright (c) 2012, Facebook Inc.
  4. Copyright (c) 2014, 2019, MariaDB Corporation.
  5. This program is free software; you can redistribute it and/or modify it under
  6. the terms of the GNU General Public License as published by the Free Software
  7. Foundation; version 2 of the License.
  8. This program is distributed in the hope that it will be useful, but WITHOUT
  9. ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
  10. FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
  11. You should have received a copy of the GNU General Public License along with
  12. this program; if not, write to the Free Software Foundation, Inc.,
  13. 51 Franklin Street, Fifth Floor, Boston, MA 02110-1335 USA
  14. *****************************************************************************/
  15. /**************************************************//**
  16. @file page/page0zip.cc
  17. Compressed page interface
  18. Created June 2005 by Marko Makela
  19. *******************************************************/
  20. #include "page0zip.h"
  21. #include "fsp0types.h"
  22. #include "page0page.h"
  23. #include "buf0checksum.h"
  24. #include "ut0crc32.h"
  25. #include "zlib.h"
  26. #ifndef UNIV_INNOCHECKSUM
  27. /** A BLOB field reference full of zero, for use in assertions and tests.
  28. Initially, BLOB field references are set to zero, in
  29. dtuple_convert_big_rec(). */
  30. const byte field_ref_zero[UNIV_PAGE_SIZE_MAX] = { 0, };
  31. #include "mtr0log.h"
  32. #include "dict0dict.h"
  33. #include "btr0cur.h"
  34. #include "log0recv.h"
  35. #include "row0row.h"
  36. #include "btr0sea.h"
  37. #include "dict0boot.h"
  38. #include "lock0lock.h"
  39. #include "srv0srv.h"
  40. #include "buf0lru.h"
  41. #include "srv0mon.h"
  42. #include <map>
  43. #include <algorithm>
  44. /** Statistics on compression, indexed by page_zip_des_t::ssize - 1 */
  45. page_zip_stat_t page_zip_stat[PAGE_ZIP_SSIZE_MAX];
  46. /** Statistics on compression, indexed by index->id */
  47. page_zip_stat_per_index_t page_zip_stat_per_index;
  48. /** Compression level to be used by zlib. Settable by user. */
  49. uint page_zip_level;
  50. /** Whether or not to log compressed page images to avoid possible
  51. compression algorithm changes in zlib. */
  52. my_bool page_zip_log_pages;
  53. /* Please refer to ../include/page0zip.ic for a description of the
  54. compressed page format. */
  55. /* The infimum and supremum records are omitted from the compressed page.
  56. On compress, we compare that the records are there, and on uncompress we
  57. restore the records. */
  58. /** Extra bytes of an infimum record */
  59. static const byte infimum_extra[] = {
  60. 0x01, /* info_bits=0, n_owned=1 */
  61. 0x00, 0x02 /* heap_no=0, status=2 */
  62. /* ?, ? */ /* next=(first user rec, or supremum) */
  63. };
  64. /** Data bytes of an infimum record */
  65. static const byte infimum_data[] = {
  66. 0x69, 0x6e, 0x66, 0x69,
  67. 0x6d, 0x75, 0x6d, 0x00 /* "infimum\0" */
  68. };
  69. /** Extra bytes and data bytes of a supremum record */
  70. static const byte supremum_extra_data alignas(4) [] = {
  71. /* 0x0?, */ /* info_bits=0, n_owned=1..8 */
  72. 0x00, 0x0b, /* heap_no=1, status=3 */
  73. 0x00, 0x00, /* next=0 */
  74. 0x73, 0x75, 0x70, 0x72,
  75. 0x65, 0x6d, 0x75, 0x6d /* "supremum" */
  76. };
  77. /** Assert that a block of memory is filled with zero bytes.
  78. Compare at most sizeof(field_ref_zero) bytes.
  79. @param b in: memory block
  80. @param s in: size of the memory block, in bytes */
  81. #define ASSERT_ZERO(b, s) \
  82. ut_ad(!memcmp(b, field_ref_zero, \
  83. std::min<size_t>(s, sizeof field_ref_zero)));
  84. /** Assert that a BLOB pointer is filled with zero bytes.
  85. @param b in: BLOB pointer */
  86. #define ASSERT_ZERO_BLOB(b) \
  87. ut_ad(!memcmp(b, field_ref_zero, FIELD_REF_SIZE))
  88. /* Enable some extra debugging output. This code can be enabled
  89. independently of any UNIV_ debugging conditions. */
  90. #if defined UNIV_DEBUG || defined UNIV_ZIP_DEBUG
  91. # include <stdarg.h>
  92. MY_ATTRIBUTE((format (printf, 1, 2)))
  93. /**********************************************************************//**
  94. Report a failure to decompress or compress.
  95. @return number of characters printed */
  96. static
  97. int
  98. page_zip_fail_func(
  99. /*===============*/
  100. const char* fmt, /*!< in: printf(3) format string */
  101. ...) /*!< in: arguments corresponding to fmt */
  102. {
  103. int res;
  104. va_list ap;
  105. ut_print_timestamp(stderr);
  106. fputs(" InnoDB: ", stderr);
  107. va_start(ap, fmt);
  108. res = vfprintf(stderr, fmt, ap);
  109. va_end(ap);
  110. return(res);
  111. }
  112. /** Wrapper for page_zip_fail_func()
  113. @param fmt_args in: printf(3) format string and arguments */
  114. # define page_zip_fail(fmt_args) page_zip_fail_func fmt_args
  115. #else /* UNIV_DEBUG || UNIV_ZIP_DEBUG */
  116. /** Dummy wrapper for page_zip_fail_func()
  117. @param fmt_args ignored: printf(3) format string and arguments */
  118. # define page_zip_fail(fmt_args) /* empty */
  119. #endif /* UNIV_DEBUG || UNIV_ZIP_DEBUG */
  120. /**********************************************************************//**
  121. Determine the guaranteed free space on an empty page.
  122. @return minimum payload size on the page */
  123. ulint
  124. page_zip_empty_size(
  125. /*================*/
  126. ulint n_fields, /*!< in: number of columns in the index */
  127. ulint zip_size) /*!< in: compressed page size in bytes */
  128. {
  129. ulint size = zip_size
  130. /* subtract the page header and the longest
  131. uncompressed data needed for one record */
  132. - (PAGE_DATA
  133. + PAGE_ZIP_CLUST_LEAF_SLOT_SIZE
  134. + 1/* encoded heap_no==2 in page_zip_write_rec() */
  135. + 1/* end of modification log */
  136. - REC_N_NEW_EXTRA_BYTES/* omitted bytes */)
  137. /* subtract the space for page_zip_fields_encode() */
  138. - compressBound(static_cast<uLong>(2 * (n_fields + 1)));
  139. return(lint(size) > 0 ? size : 0);
  140. }
  141. /** Check whether a tuple is too big for compressed table
  142. @param[in] index dict index object
  143. @param[in] entry entry for the index
  144. @return true if it's too big, otherwise false */
  145. bool
  146. page_zip_is_too_big(
  147. const dict_index_t* index,
  148. const dtuple_t* entry)
  149. {
  150. const ulint zip_size = index->table->space->zip_size();
  151. /* Estimate the free space of an empty compressed page.
  152. Subtract one byte for the encoded heap_no in the
  153. modification log. */
  154. ulint free_space_zip = page_zip_empty_size(
  155. index->n_fields, zip_size);
  156. ulint n_uniq = dict_index_get_n_unique_in_tree(index);
  157. ut_ad(dict_table_is_comp(index->table));
  158. ut_ad(zip_size);
  159. if (free_space_zip == 0) {
  160. return(true);
  161. }
  162. /* Subtract one byte for the encoded heap_no in the
  163. modification log. */
  164. free_space_zip--;
  165. /* There should be enough room for two node pointer
  166. records on an empty non-leaf page. This prevents
  167. infinite page splits. */
  168. if (entry->n_fields >= n_uniq
  169. && (REC_NODE_PTR_SIZE
  170. + rec_get_converted_size_comp_prefix(
  171. index, entry->fields, n_uniq, NULL)
  172. /* On a compressed page, there is
  173. a two-byte entry in the dense
  174. page directory for every record.
  175. But there is no record header. */
  176. - (REC_N_NEW_EXTRA_BYTES - 2)
  177. > free_space_zip / 2)) {
  178. return(true);
  179. }
  180. return(false);
  181. }
  182. /*************************************************************//**
  183. Gets the number of elements in the dense page directory,
  184. including deleted records (the free list).
  185. @return number of elements in the dense page directory */
  186. UNIV_INLINE
  187. ulint
  188. page_zip_dir_elems(
  189. /*===============*/
  190. const page_zip_des_t* page_zip) /*!< in: compressed page */
  191. {
  192. /* Exclude the page infimum and supremum from the record count. */
  193. return ulint(page_dir_get_n_heap(page_zip->data))
  194. - PAGE_HEAP_NO_USER_LOW;
  195. }
  196. /*************************************************************//**
  197. Gets the size of the compressed page trailer (the dense page directory),
  198. including deleted records (the free list).
  199. @return length of dense page directory, in bytes */
  200. UNIV_INLINE
  201. ulint
  202. page_zip_dir_size(
  203. /*==============*/
  204. const page_zip_des_t* page_zip) /*!< in: compressed page */
  205. {
  206. return(PAGE_ZIP_DIR_SLOT_SIZE * page_zip_dir_elems(page_zip));
  207. }
  208. /*************************************************************//**
  209. Gets an offset to the compressed page trailer (the dense page directory),
  210. including deleted records (the free list).
  211. @return offset of the dense page directory */
  212. UNIV_INLINE
  213. ulint
  214. page_zip_dir_start_offs(
  215. /*====================*/
  216. const page_zip_des_t* page_zip, /*!< in: compressed page */
  217. ulint n_dense) /*!< in: directory size */
  218. {
  219. ut_ad(n_dense * PAGE_ZIP_DIR_SLOT_SIZE < page_zip_get_size(page_zip));
  220. return(page_zip_get_size(page_zip) - n_dense * PAGE_ZIP_DIR_SLOT_SIZE);
  221. }
  222. /*************************************************************//**
  223. Gets a pointer to the compressed page trailer (the dense page directory),
  224. including deleted records (the free list).
  225. @param[in] page_zip compressed page
  226. @param[in] n_dense number of entries in the directory
  227. @return pointer to the dense page directory */
  228. #define page_zip_dir_start_low(page_zip, n_dense) \
  229. ((page_zip)->data + page_zip_dir_start_offs(page_zip, n_dense))
  230. /*************************************************************//**
  231. Gets a pointer to the compressed page trailer (the dense page directory),
  232. including deleted records (the free list).
  233. @param[in] page_zip compressed page
  234. @return pointer to the dense page directory */
  235. #define page_zip_dir_start(page_zip) \
  236. page_zip_dir_start_low(page_zip, page_zip_dir_elems(page_zip))
  237. /*************************************************************//**
  238. Gets the size of the compressed page trailer (the dense page directory),
  239. only including user records (excluding the free list).
  240. @return length of dense page directory comprising existing records, in bytes */
  241. UNIV_INLINE
  242. ulint
  243. page_zip_dir_user_size(
  244. /*===================*/
  245. const page_zip_des_t* page_zip) /*!< in: compressed page */
  246. {
  247. ulint size = PAGE_ZIP_DIR_SLOT_SIZE
  248. * ulint(page_get_n_recs(page_zip->data));
  249. ut_ad(size <= page_zip_dir_size(page_zip));
  250. return(size);
  251. }
  252. /*************************************************************//**
  253. Find the slot of the given record in the dense page directory.
  254. @return dense directory slot, or NULL if record not found */
  255. UNIV_INLINE
  256. byte*
  257. page_zip_dir_find_low(
  258. /*==================*/
  259. byte* slot, /*!< in: start of records */
  260. byte* end, /*!< in: end of records */
  261. ulint offset) /*!< in: offset of user record */
  262. {
  263. ut_ad(slot <= end);
  264. for (; slot < end; slot += PAGE_ZIP_DIR_SLOT_SIZE) {
  265. if ((mach_read_from_2(slot) & PAGE_ZIP_DIR_SLOT_MASK)
  266. == offset) {
  267. return(slot);
  268. }
  269. }
  270. return(NULL);
  271. }
  272. /*************************************************************//**
  273. Find the slot of the given non-free record in the dense page directory.
  274. @return dense directory slot, or NULL if record not found */
  275. UNIV_INLINE
  276. byte*
  277. page_zip_dir_find(
  278. /*==============*/
  279. page_zip_des_t* page_zip, /*!< in: compressed page */
  280. ulint offset) /*!< in: offset of user record */
  281. {
  282. byte* end = page_zip->data + page_zip_get_size(page_zip);
  283. ut_ad(page_zip_simple_validate(page_zip));
  284. return(page_zip_dir_find_low(end - page_zip_dir_user_size(page_zip),
  285. end,
  286. offset));
  287. }
  288. /*************************************************************//**
  289. Find the slot of the given free record in the dense page directory.
  290. @return dense directory slot, or NULL if record not found */
  291. UNIV_INLINE
  292. byte*
  293. page_zip_dir_find_free(
  294. /*===================*/
  295. page_zip_des_t* page_zip, /*!< in: compressed page */
  296. ulint offset) /*!< in: offset of user record */
  297. {
  298. byte* end = page_zip->data + page_zip_get_size(page_zip);
  299. ut_ad(page_zip_simple_validate(page_zip));
  300. return(page_zip_dir_find_low(end - page_zip_dir_size(page_zip),
  301. end - page_zip_dir_user_size(page_zip),
  302. offset));
  303. }
  304. /*************************************************************//**
  305. Read a given slot in the dense page directory.
  306. @return record offset on the uncompressed page, possibly ORed with
  307. PAGE_ZIP_DIR_SLOT_DEL or PAGE_ZIP_DIR_SLOT_OWNED */
  308. UNIV_INLINE
  309. ulint
  310. page_zip_dir_get(
  311. /*=============*/
  312. const page_zip_des_t* page_zip, /*!< in: compressed page */
  313. ulint slot) /*!< in: slot
  314. (0=first user record) */
  315. {
  316. ut_ad(page_zip_simple_validate(page_zip));
  317. ut_ad(slot < page_zip_dir_size(page_zip) / PAGE_ZIP_DIR_SLOT_SIZE);
  318. return(mach_read_from_2(page_zip->data + page_zip_get_size(page_zip)
  319. - PAGE_ZIP_DIR_SLOT_SIZE * (slot + 1)));
  320. }
  321. /** Write a MLOG_ZIP_PAGE_COMPRESS record of compressing an index page.
  322. @param[in,out] block ROW_FORMAT=COMPRESSED index page
  323. @param[in] index the index that the block belongs to
  324. @param[in,out] mtr mini-transaction */
  325. static void page_zip_compress_write_log(buf_block_t* block,
  326. dict_index_t* index, mtr_t* mtr)
  327. {
  328. byte* log_ptr;
  329. ulint trailer_size;
  330. ut_ad(!dict_index_is_ibuf(index));
  331. log_ptr = mlog_open(mtr, 11 + 2 + 2);
  332. if (!log_ptr) {
  333. return;
  334. }
  335. const page_t* page = block->frame;
  336. const page_zip_des_t* page_zip = &block->page.zip;
  337. /* Read the number of user records. */
  338. trailer_size = ulint(page_dir_get_n_heap(page_zip->data))
  339. - PAGE_HEAP_NO_USER_LOW;
  340. /* Multiply by uncompressed of size stored per record */
  341. if (!page_is_leaf(page)) {
  342. trailer_size *= PAGE_ZIP_DIR_SLOT_SIZE + REC_NODE_PTR_SIZE;
  343. } else if (dict_index_is_clust(index)) {
  344. trailer_size *= PAGE_ZIP_DIR_SLOT_SIZE
  345. + DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN;
  346. } else {
  347. trailer_size *= PAGE_ZIP_DIR_SLOT_SIZE;
  348. }
  349. /* Add the space occupied by BLOB pointers. */
  350. trailer_size += page_zip->n_blobs * BTR_EXTERN_FIELD_REF_SIZE;
  351. ut_a(page_zip->m_end > PAGE_DATA);
  352. compile_time_assert(FIL_PAGE_DATA <= PAGE_DATA);
  353. ut_a(page_zip->m_end + trailer_size <= page_zip_get_size(page_zip));
  354. log_ptr = mlog_write_initial_log_record_low(MLOG_ZIP_PAGE_COMPRESS,
  355. block->page.id.space(),
  356. block->page.id.page_no(),
  357. log_ptr, mtr);
  358. mach_write_to_2(log_ptr, ulint(page_zip->m_end - FIL_PAGE_TYPE));
  359. log_ptr += 2;
  360. mach_write_to_2(log_ptr, trailer_size);
  361. log_ptr += 2;
  362. mlog_close(mtr, log_ptr);
  363. /* Write FIL_PAGE_PREV and FIL_PAGE_NEXT */
  364. mlog_catenate_string(mtr, page_zip->data + FIL_PAGE_PREV, 4);
  365. mlog_catenate_string(mtr, page_zip->data + FIL_PAGE_NEXT, 4);
  366. /* Write most of the page header, the compressed stream and
  367. the modification log. */
  368. mlog_catenate_string(mtr, page_zip->data + FIL_PAGE_TYPE,
  369. ulint(page_zip->m_end - FIL_PAGE_TYPE));
  370. /* Write the uncompressed trailer of the compressed page. */
  371. mlog_catenate_string(mtr, page_zip->data + page_zip_get_size(page_zip)
  372. - trailer_size, trailer_size);
  373. if (!innodb_log_optimize_ddl) {
  374. block->page.init_on_flush = true;
  375. }
  376. }
  377. /******************************************************//**
  378. Determine how many externally stored columns are contained
  379. in existing records with smaller heap_no than rec. */
  380. static
  381. ulint
  382. page_zip_get_n_prev_extern(
  383. /*=======================*/
  384. const page_zip_des_t* page_zip,/*!< in: dense page directory on
  385. compressed page */
  386. const rec_t* rec, /*!< in: compact physical record
  387. on a B-tree leaf page */
  388. const dict_index_t* index) /*!< in: record descriptor */
  389. {
  390. const page_t* page = page_align(rec);
  391. ulint n_ext = 0;
  392. ulint i;
  393. ulint left;
  394. ulint heap_no;
  395. ulint n_recs = page_get_n_recs(page_zip->data);
  396. ut_ad(page_is_leaf(page));
  397. ut_ad(page_is_comp(page));
  398. ut_ad(dict_table_is_comp(index->table));
  399. ut_ad(dict_index_is_clust(index));
  400. ut_ad(!dict_index_is_ibuf(index));
  401. heap_no = rec_get_heap_no_new(rec);
  402. ut_ad(heap_no >= PAGE_HEAP_NO_USER_LOW);
  403. left = heap_no - PAGE_HEAP_NO_USER_LOW;
  404. if (UNIV_UNLIKELY(!left)) {
  405. return(0);
  406. }
  407. for (i = 0; i < n_recs; i++) {
  408. const rec_t* r = page + (page_zip_dir_get(page_zip, i)
  409. & PAGE_ZIP_DIR_SLOT_MASK);
  410. if (rec_get_heap_no_new(r) < heap_no) {
  411. n_ext += rec_get_n_extern_new(r, index,
  412. ULINT_UNDEFINED);
  413. if (!--left) {
  414. break;
  415. }
  416. }
  417. }
  418. return(n_ext);
  419. }
  420. /**********************************************************************//**
  421. Encode the length of a fixed-length column.
  422. @return buf + length of encoded val */
  423. static
  424. byte*
  425. page_zip_fixed_field_encode(
  426. /*========================*/
  427. byte* buf, /*!< in: pointer to buffer where to write */
  428. ulint val) /*!< in: value to write */
  429. {
  430. ut_ad(val >= 2);
  431. if (UNIV_LIKELY(val < 126)) {
  432. /*
  433. 0 = nullable variable field of at most 255 bytes length;
  434. 1 = not null variable field of at most 255 bytes length;
  435. 126 = nullable variable field with maximum length >255;
  436. 127 = not null variable field with maximum length >255
  437. */
  438. *buf++ = (byte) val;
  439. } else {
  440. *buf++ = (byte) (0x80 | val >> 8);
  441. *buf++ = (byte) val;
  442. }
  443. return(buf);
  444. }
  445. /**********************************************************************//**
  446. Write the index information for the compressed page.
  447. @return used size of buf */
  448. ulint
  449. page_zip_fields_encode(
  450. /*===================*/
  451. ulint n, /*!< in: number of fields
  452. to compress */
  453. const dict_index_t* index, /*!< in: index comprising
  454. at least n fields */
  455. ulint trx_id_pos,
  456. /*!< in: position of the trx_id column
  457. in the index, or ULINT_UNDEFINED if
  458. this is a non-leaf page */
  459. byte* buf) /*!< out: buffer of (n + 1) * 2 bytes */
  460. {
  461. const byte* buf_start = buf;
  462. ulint i;
  463. ulint col;
  464. ulint trx_id_col = 0;
  465. /* sum of lengths of preceding non-nullable fixed fields, or 0 */
  466. ulint fixed_sum = 0;
  467. ut_ad(trx_id_pos == ULINT_UNDEFINED || trx_id_pos < n);
  468. for (i = col = 0; i < n; i++) {
  469. dict_field_t* field = dict_index_get_nth_field(index, i);
  470. ulint val;
  471. if (dict_field_get_col(field)->prtype & DATA_NOT_NULL) {
  472. val = 1; /* set the "not nullable" flag */
  473. } else {
  474. val = 0; /* nullable field */
  475. }
  476. if (!field->fixed_len) {
  477. /* variable-length field */
  478. const dict_col_t* column
  479. = dict_field_get_col(field);
  480. if (DATA_BIG_COL(column)) {
  481. val |= 0x7e; /* max > 255 bytes */
  482. }
  483. if (fixed_sum) {
  484. /* write out the length of any
  485. preceding non-nullable fields */
  486. buf = page_zip_fixed_field_encode(
  487. buf, fixed_sum << 1 | 1);
  488. fixed_sum = 0;
  489. col++;
  490. }
  491. *buf++ = (byte) val;
  492. col++;
  493. } else if (val) {
  494. /* fixed-length non-nullable field */
  495. if (fixed_sum && UNIV_UNLIKELY
  496. (fixed_sum + field->fixed_len
  497. > DICT_MAX_FIXED_COL_LEN)) {
  498. /* Write out the length of the
  499. preceding non-nullable fields,
  500. to avoid exceeding the maximum
  501. length of a fixed-length column. */
  502. buf = page_zip_fixed_field_encode(
  503. buf, fixed_sum << 1 | 1);
  504. fixed_sum = 0;
  505. col++;
  506. }
  507. if (i && UNIV_UNLIKELY(i == trx_id_pos)) {
  508. if (fixed_sum) {
  509. /* Write out the length of any
  510. preceding non-nullable fields,
  511. and start a new trx_id column. */
  512. buf = page_zip_fixed_field_encode(
  513. buf, fixed_sum << 1 | 1);
  514. col++;
  515. }
  516. trx_id_col = col;
  517. fixed_sum = field->fixed_len;
  518. } else {
  519. /* add to the sum */
  520. fixed_sum += field->fixed_len;
  521. }
  522. } else {
  523. /* fixed-length nullable field */
  524. if (fixed_sum) {
  525. /* write out the length of any
  526. preceding non-nullable fields */
  527. buf = page_zip_fixed_field_encode(
  528. buf, fixed_sum << 1 | 1);
  529. fixed_sum = 0;
  530. col++;
  531. }
  532. buf = page_zip_fixed_field_encode(
  533. buf, ulint(field->fixed_len) << 1);
  534. col++;
  535. }
  536. }
  537. if (fixed_sum) {
  538. /* Write out the lengths of last fixed-length columns. */
  539. buf = page_zip_fixed_field_encode(buf, fixed_sum << 1 | 1);
  540. }
  541. if (trx_id_pos != ULINT_UNDEFINED) {
  542. /* Write out the position of the trx_id column */
  543. i = trx_id_col;
  544. } else {
  545. /* Write out the number of nullable fields */
  546. i = index->n_nullable;
  547. }
  548. if (i < 128) {
  549. *buf++ = (byte) i;
  550. } else {
  551. *buf++ = (byte) (0x80 | i >> 8);
  552. *buf++ = (byte) i;
  553. }
  554. ut_ad((ulint) (buf - buf_start) <= (n + 2) * 2);
  555. return((ulint) (buf - buf_start));
  556. }
  557. /**********************************************************************//**
  558. Populate the dense page directory from the sparse directory. */
  559. static
  560. void
  561. page_zip_dir_encode(
  562. /*================*/
  563. const page_t* page, /*!< in: compact page */
  564. byte* buf, /*!< in: pointer to dense page directory[-1];
  565. out: dense directory on compressed page */
  566. const rec_t** recs) /*!< in: pointer to an array of 0, or NULL;
  567. out: dense page directory sorted by ascending
  568. address (and heap_no) */
  569. {
  570. const byte* rec;
  571. ulint status;
  572. ulint min_mark;
  573. ulint heap_no;
  574. ulint i;
  575. ulint n_heap;
  576. ulint offs;
  577. min_mark = 0;
  578. if (page_is_leaf(page)) {
  579. status = REC_STATUS_ORDINARY;
  580. } else {
  581. status = REC_STATUS_NODE_PTR;
  582. if (UNIV_UNLIKELY(!page_has_prev(page))) {
  583. min_mark = REC_INFO_MIN_REC_FLAG;
  584. }
  585. }
  586. n_heap = page_dir_get_n_heap(page);
  587. /* Traverse the list of stored records in the collation order,
  588. starting from the first user record. */
  589. rec = page + PAGE_NEW_INFIMUM;
  590. i = 0;
  591. for (;;) {
  592. ulint info_bits;
  593. offs = rec_get_next_offs(rec, TRUE);
  594. if (UNIV_UNLIKELY(offs == PAGE_NEW_SUPREMUM)) {
  595. break;
  596. }
  597. rec = page + offs;
  598. heap_no = rec_get_heap_no_new(rec);
  599. ut_a(heap_no >= PAGE_HEAP_NO_USER_LOW);
  600. ut_a(heap_no < n_heap);
  601. ut_a(offs < srv_page_size - PAGE_DIR);
  602. ut_a(offs >= PAGE_ZIP_START);
  603. compile_time_assert(!(PAGE_ZIP_DIR_SLOT_MASK
  604. & (PAGE_ZIP_DIR_SLOT_MASK + 1)));
  605. compile_time_assert(PAGE_ZIP_DIR_SLOT_MASK
  606. >= UNIV_ZIP_SIZE_MAX - 1);
  607. if (UNIV_UNLIKELY(rec_get_n_owned_new(rec) != 0)) {
  608. offs |= PAGE_ZIP_DIR_SLOT_OWNED;
  609. }
  610. info_bits = rec_get_info_bits(rec, TRUE);
  611. if (info_bits & REC_INFO_DELETED_FLAG) {
  612. info_bits &= ~REC_INFO_DELETED_FLAG;
  613. offs |= PAGE_ZIP_DIR_SLOT_DEL;
  614. }
  615. ut_a(info_bits == min_mark);
  616. /* Only the smallest user record can have
  617. REC_INFO_MIN_REC_FLAG set. */
  618. min_mark = 0;
  619. mach_write_to_2(buf - PAGE_ZIP_DIR_SLOT_SIZE * ++i, offs);
  620. if (UNIV_LIKELY_NULL(recs)) {
  621. /* Ensure that each heap_no occurs at most once. */
  622. ut_a(!recs[heap_no - PAGE_HEAP_NO_USER_LOW]);
  623. /* exclude infimum and supremum */
  624. recs[heap_no - PAGE_HEAP_NO_USER_LOW] = rec;
  625. }
  626. ut_a(ulint(rec_get_status(rec)) == status);
  627. }
  628. offs = page_header_get_field(page, PAGE_FREE);
  629. /* Traverse the free list (of deleted records). */
  630. while (offs) {
  631. ut_ad(!(offs & ~PAGE_ZIP_DIR_SLOT_MASK));
  632. rec = page + offs;
  633. heap_no = rec_get_heap_no_new(rec);
  634. ut_a(heap_no >= PAGE_HEAP_NO_USER_LOW);
  635. ut_a(heap_no < n_heap);
  636. ut_a(!rec[-REC_N_NEW_EXTRA_BYTES]); /* info_bits and n_owned */
  637. ut_a(ulint(rec_get_status(rec)) == status);
  638. mach_write_to_2(buf - PAGE_ZIP_DIR_SLOT_SIZE * ++i, offs);
  639. if (UNIV_LIKELY_NULL(recs)) {
  640. /* Ensure that each heap_no occurs at most once. */
  641. ut_a(!recs[heap_no - PAGE_HEAP_NO_USER_LOW]);
  642. /* exclude infimum and supremum */
  643. recs[heap_no - PAGE_HEAP_NO_USER_LOW] = rec;
  644. }
  645. offs = rec_get_next_offs(rec, TRUE);
  646. }
  647. /* Ensure that each heap no occurs at least once. */
  648. ut_a(i + PAGE_HEAP_NO_USER_LOW == n_heap);
  649. }
  650. extern "C" {
  651. /**********************************************************************//**
  652. Allocate memory for zlib. */
  653. static
  654. void*
  655. page_zip_zalloc(
  656. /*============*/
  657. void* opaque, /*!< in/out: memory heap */
  658. uInt items, /*!< in: number of items to allocate */
  659. uInt size) /*!< in: size of an item in bytes */
  660. {
  661. return(mem_heap_zalloc(static_cast<mem_heap_t*>(opaque), items * size));
  662. }
  663. /**********************************************************************//**
  664. Deallocate memory for zlib. */
  665. static
  666. void
  667. page_zip_free(
  668. /*==========*/
  669. void* opaque MY_ATTRIBUTE((unused)), /*!< in: memory heap */
  670. void* address MY_ATTRIBUTE((unused)))/*!< in: object to free */
  671. {
  672. }
  673. } /* extern "C" */
  674. /**********************************************************************//**
  675. Configure the zlib allocator to use the given memory heap. */
  676. void
  677. page_zip_set_alloc(
  678. /*===============*/
  679. void* stream, /*!< in/out: zlib stream */
  680. mem_heap_t* heap) /*!< in: memory heap to use */
  681. {
  682. z_stream* strm = static_cast<z_stream*>(stream);
  683. strm->zalloc = page_zip_zalloc;
  684. strm->zfree = page_zip_free;
  685. strm->opaque = heap;
  686. }
  687. #if 0 || defined UNIV_DEBUG || defined UNIV_ZIP_DEBUG
  688. /** Symbol for enabling compression and decompression diagnostics */
  689. # define PAGE_ZIP_COMPRESS_DBG
  690. #endif
  691. #ifdef PAGE_ZIP_COMPRESS_DBG
  692. /** Set this variable in a debugger to enable
  693. excessive logging in page_zip_compress(). */
  694. static bool page_zip_compress_dbg;
  695. /** Set this variable in a debugger to enable
  696. binary logging of the data passed to deflate().
  697. When this variable is nonzero, it will act
  698. as a log file name generator. */
  699. static unsigned page_zip_compress_log;
  700. /**********************************************************************//**
  701. Wrapper for deflate(). Log the operation if page_zip_compress_dbg is set.
  702. @return deflate() status: Z_OK, Z_BUF_ERROR, ... */
  703. static
  704. int
  705. page_zip_compress_deflate(
  706. /*======================*/
  707. FILE* logfile,/*!< in: log file, or NULL */
  708. z_streamp strm, /*!< in/out: compressed stream for deflate() */
  709. int flush) /*!< in: deflate() flushing method */
  710. {
  711. int status;
  712. if (UNIV_UNLIKELY(page_zip_compress_dbg)) {
  713. ut_print_buf(stderr, strm->next_in, strm->avail_in);
  714. }
  715. if (UNIV_LIKELY_NULL(logfile)) {
  716. if (fwrite(strm->next_in, 1, strm->avail_in, logfile)
  717. != strm->avail_in) {
  718. perror("fwrite");
  719. }
  720. }
  721. status = deflate(strm, flush);
  722. if (UNIV_UNLIKELY(page_zip_compress_dbg)) {
  723. fprintf(stderr, " -> %d\n", status);
  724. }
  725. return(status);
  726. }
  727. /* Redefine deflate(). */
  728. # undef deflate
  729. /** Debug wrapper for the zlib compression routine deflate().
  730. Log the operation if page_zip_compress_dbg is set.
  731. @param strm in/out: compressed stream
  732. @param flush in: flushing method
  733. @return deflate() status: Z_OK, Z_BUF_ERROR, ... */
  734. # define deflate(strm, flush) page_zip_compress_deflate(logfile, strm, flush)
  735. /** Declaration of the logfile parameter */
  736. # define FILE_LOGFILE FILE* logfile,
  737. /** The logfile parameter */
  738. # define LOGFILE logfile,
  739. #else /* PAGE_ZIP_COMPRESS_DBG */
  740. /** Empty declaration of the logfile parameter */
  741. # define FILE_LOGFILE
  742. /** Missing logfile parameter */
  743. # define LOGFILE
  744. #endif /* PAGE_ZIP_COMPRESS_DBG */
  745. /**********************************************************************//**
  746. Compress the records of a node pointer page.
  747. @return Z_OK, or a zlib error code */
  748. static
  749. int
  750. page_zip_compress_node_ptrs(
  751. /*========================*/
  752. FILE_LOGFILE
  753. z_stream* c_stream, /*!< in/out: compressed page stream */
  754. const rec_t** recs, /*!< in: dense page directory
  755. sorted by address */
  756. ulint n_dense, /*!< in: size of recs[] */
  757. dict_index_t* index, /*!< in: the index of the page */
  758. byte* storage, /*!< in: end of dense page directory */
  759. mem_heap_t* heap) /*!< in: temporary memory heap */
  760. {
  761. int err = Z_OK;
  762. offset_t* offsets = NULL;
  763. do {
  764. const rec_t* rec = *recs++;
  765. offsets = rec_get_offsets(rec, index, offsets, false,
  766. ULINT_UNDEFINED, &heap);
  767. /* Only leaf nodes may contain externally stored columns. */
  768. ut_ad(!rec_offs_any_extern(offsets));
  769. UNIV_MEM_ASSERT_RW(rec, rec_offs_data_size(offsets));
  770. UNIV_MEM_ASSERT_RW(rec - rec_offs_extra_size(offsets),
  771. rec_offs_extra_size(offsets));
  772. /* Compress the extra bytes. */
  773. c_stream->avail_in = static_cast<uInt>(
  774. rec - REC_N_NEW_EXTRA_BYTES - c_stream->next_in);
  775. if (c_stream->avail_in) {
  776. err = deflate(c_stream, Z_NO_FLUSH);
  777. if (UNIV_UNLIKELY(err != Z_OK)) {
  778. break;
  779. }
  780. }
  781. ut_ad(!c_stream->avail_in);
  782. /* Compress the data bytes, except node_ptr. */
  783. c_stream->next_in = (byte*) rec;
  784. c_stream->avail_in = static_cast<uInt>(
  785. rec_offs_data_size(offsets) - REC_NODE_PTR_SIZE);
  786. if (c_stream->avail_in) {
  787. err = deflate(c_stream, Z_NO_FLUSH);
  788. if (UNIV_UNLIKELY(err != Z_OK)) {
  789. break;
  790. }
  791. }
  792. ut_ad(!c_stream->avail_in);
  793. memcpy(storage - REC_NODE_PTR_SIZE
  794. * (rec_get_heap_no_new(rec) - 1),
  795. c_stream->next_in, REC_NODE_PTR_SIZE);
  796. c_stream->next_in += REC_NODE_PTR_SIZE;
  797. } while (--n_dense);
  798. return(err);
  799. }
  800. /**********************************************************************//**
  801. Compress the records of a leaf node of a secondary index.
  802. @return Z_OK, or a zlib error code */
  803. static
  804. int
  805. page_zip_compress_sec(
  806. /*==================*/
  807. FILE_LOGFILE
  808. z_stream* c_stream, /*!< in/out: compressed page stream */
  809. const rec_t** recs, /*!< in: dense page directory
  810. sorted by address */
  811. ulint n_dense) /*!< in: size of recs[] */
  812. {
  813. int err = Z_OK;
  814. ut_ad(n_dense > 0);
  815. do {
  816. const rec_t* rec = *recs++;
  817. /* Compress everything up to this record. */
  818. c_stream->avail_in = static_cast<uInt>(
  819. rec - REC_N_NEW_EXTRA_BYTES
  820. - c_stream->next_in);
  821. if (UNIV_LIKELY(c_stream->avail_in != 0)) {
  822. UNIV_MEM_ASSERT_RW(c_stream->next_in,
  823. c_stream->avail_in);
  824. err = deflate(c_stream, Z_NO_FLUSH);
  825. if (UNIV_UNLIKELY(err != Z_OK)) {
  826. break;
  827. }
  828. }
  829. ut_ad(!c_stream->avail_in);
  830. ut_ad(c_stream->next_in == rec - REC_N_NEW_EXTRA_BYTES);
  831. /* Skip the REC_N_NEW_EXTRA_BYTES. */
  832. c_stream->next_in = (byte*) rec;
  833. } while (--n_dense);
  834. return(err);
  835. }
  836. /**********************************************************************//**
  837. Compress a record of a leaf node of a clustered index that contains
  838. externally stored columns.
  839. @return Z_OK, or a zlib error code */
  840. static
  841. int
  842. page_zip_compress_clust_ext(
  843. /*========================*/
  844. FILE_LOGFILE
  845. z_stream* c_stream, /*!< in/out: compressed page stream */
  846. const rec_t* rec, /*!< in: record */
  847. const offset_t* offsets, /*!< in: rec_get_offsets(rec) */
  848. ulint trx_id_col, /*!< in: position of of DB_TRX_ID */
  849. byte* deleted, /*!< in: dense directory entry pointing
  850. to the head of the free list */
  851. byte* storage, /*!< in: end of dense page directory */
  852. byte** externs, /*!< in/out: pointer to the next
  853. available BLOB pointer */
  854. ulint* n_blobs) /*!< in/out: number of
  855. externally stored columns */
  856. {
  857. int err;
  858. ulint i;
  859. UNIV_MEM_ASSERT_RW(rec, rec_offs_data_size(offsets));
  860. UNIV_MEM_ASSERT_RW(rec - rec_offs_extra_size(offsets),
  861. rec_offs_extra_size(offsets));
  862. for (i = 0; i < rec_offs_n_fields(offsets); i++) {
  863. ulint len;
  864. const byte* src;
  865. if (UNIV_UNLIKELY(i == trx_id_col)) {
  866. ut_ad(!rec_offs_nth_extern(offsets, i));
  867. /* Store trx_id and roll_ptr
  868. in uncompressed form. */
  869. src = rec_get_nth_field(rec, offsets, i, &len);
  870. ut_ad(src + DATA_TRX_ID_LEN
  871. == rec_get_nth_field(rec, offsets,
  872. i + 1, &len));
  873. ut_ad(len == DATA_ROLL_PTR_LEN);
  874. /* Compress any preceding bytes. */
  875. c_stream->avail_in = static_cast<uInt>(
  876. src - c_stream->next_in);
  877. if (c_stream->avail_in) {
  878. err = deflate(c_stream, Z_NO_FLUSH);
  879. if (UNIV_UNLIKELY(err != Z_OK)) {
  880. return(err);
  881. }
  882. }
  883. ut_ad(!c_stream->avail_in);
  884. ut_ad(c_stream->next_in == src);
  885. memcpy(storage
  886. - (DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN)
  887. * (rec_get_heap_no_new(rec) - 1),
  888. c_stream->next_in,
  889. DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  890. c_stream->next_in
  891. += DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN;
  892. /* Skip also roll_ptr */
  893. i++;
  894. } else if (rec_offs_nth_extern(offsets, i)) {
  895. src = rec_get_nth_field(rec, offsets, i, &len);
  896. ut_ad(len >= BTR_EXTERN_FIELD_REF_SIZE);
  897. src += len - BTR_EXTERN_FIELD_REF_SIZE;
  898. c_stream->avail_in = static_cast<uInt>(
  899. src - c_stream->next_in);
  900. if (UNIV_LIKELY(c_stream->avail_in != 0)) {
  901. err = deflate(c_stream, Z_NO_FLUSH);
  902. if (UNIV_UNLIKELY(err != Z_OK)) {
  903. return(err);
  904. }
  905. }
  906. ut_ad(!c_stream->avail_in);
  907. ut_ad(c_stream->next_in == src);
  908. /* Reserve space for the data at
  909. the end of the space reserved for
  910. the compressed data and the page
  911. modification log. */
  912. if (UNIV_UNLIKELY
  913. (c_stream->avail_out
  914. <= BTR_EXTERN_FIELD_REF_SIZE)) {
  915. /* out of space */
  916. return(Z_BUF_ERROR);
  917. }
  918. ut_ad(*externs == c_stream->next_out
  919. + c_stream->avail_out
  920. + 1/* end of modif. log */);
  921. c_stream->next_in
  922. += BTR_EXTERN_FIELD_REF_SIZE;
  923. /* Skip deleted records. */
  924. if (UNIV_LIKELY_NULL
  925. (page_zip_dir_find_low(
  926. storage, deleted,
  927. page_offset(rec)))) {
  928. continue;
  929. }
  930. (*n_blobs)++;
  931. c_stream->avail_out
  932. -= BTR_EXTERN_FIELD_REF_SIZE;
  933. *externs -= BTR_EXTERN_FIELD_REF_SIZE;
  934. /* Copy the BLOB pointer */
  935. memcpy(*externs, c_stream->next_in
  936. - BTR_EXTERN_FIELD_REF_SIZE,
  937. BTR_EXTERN_FIELD_REF_SIZE);
  938. }
  939. }
  940. return(Z_OK);
  941. }
  942. /**********************************************************************//**
  943. Compress the records of a leaf node of a clustered index.
  944. @return Z_OK, or a zlib error code */
  945. static
  946. int
  947. page_zip_compress_clust(
  948. /*====================*/
  949. FILE_LOGFILE
  950. z_stream* c_stream, /*!< in/out: compressed page stream */
  951. const rec_t** recs, /*!< in: dense page directory
  952. sorted by address */
  953. ulint n_dense, /*!< in: size of recs[] */
  954. dict_index_t* index, /*!< in: the index of the page */
  955. ulint* n_blobs, /*!< in: 0; out: number of
  956. externally stored columns */
  957. ulint trx_id_col, /*!< index of the trx_id column */
  958. byte* deleted, /*!< in: dense directory entry pointing
  959. to the head of the free list */
  960. byte* storage, /*!< in: end of dense page directory */
  961. mem_heap_t* heap) /*!< in: temporary memory heap */
  962. {
  963. int err = Z_OK;
  964. offset_t* offsets = NULL;
  965. /* BTR_EXTERN_FIELD_REF storage */
  966. byte* externs = storage - n_dense
  967. * (DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  968. ut_ad(*n_blobs == 0);
  969. do {
  970. const rec_t* rec = *recs++;
  971. offsets = rec_get_offsets(rec, index, offsets, true,
  972. ULINT_UNDEFINED, &heap);
  973. ut_ad(rec_offs_n_fields(offsets)
  974. == dict_index_get_n_fields(index));
  975. UNIV_MEM_ASSERT_RW(rec, rec_offs_data_size(offsets));
  976. UNIV_MEM_ASSERT_RW(rec - rec_offs_extra_size(offsets),
  977. rec_offs_extra_size(offsets));
  978. /* Compress the extra bytes. */
  979. c_stream->avail_in = static_cast<uInt>(
  980. rec - REC_N_NEW_EXTRA_BYTES
  981. - c_stream->next_in);
  982. if (c_stream->avail_in) {
  983. err = deflate(c_stream, Z_NO_FLUSH);
  984. if (UNIV_UNLIKELY(err != Z_OK)) {
  985. goto func_exit;
  986. }
  987. }
  988. ut_ad(!c_stream->avail_in);
  989. ut_ad(c_stream->next_in == rec - REC_N_NEW_EXTRA_BYTES);
  990. /* Compress the data bytes. */
  991. c_stream->next_in = (byte*) rec;
  992. /* Check if there are any externally stored columns.
  993. For each externally stored column, store the
  994. BTR_EXTERN_FIELD_REF separately. */
  995. if (rec_offs_any_extern(offsets)) {
  996. ut_ad(dict_index_is_clust(index));
  997. err = page_zip_compress_clust_ext(
  998. LOGFILE
  999. c_stream, rec, offsets, trx_id_col,
  1000. deleted, storage, &externs, n_blobs);
  1001. if (UNIV_UNLIKELY(err != Z_OK)) {
  1002. goto func_exit;
  1003. }
  1004. } else {
  1005. ulint len;
  1006. const byte* src;
  1007. /* Store trx_id and roll_ptr in uncompressed form. */
  1008. src = rec_get_nth_field(rec, offsets,
  1009. trx_id_col, &len);
  1010. ut_ad(src + DATA_TRX_ID_LEN
  1011. == rec_get_nth_field(rec, offsets,
  1012. trx_id_col + 1, &len));
  1013. ut_ad(len == DATA_ROLL_PTR_LEN);
  1014. UNIV_MEM_ASSERT_RW(rec, rec_offs_data_size(offsets));
  1015. UNIV_MEM_ASSERT_RW(rec - rec_offs_extra_size(offsets),
  1016. rec_offs_extra_size(offsets));
  1017. /* Compress any preceding bytes. */
  1018. c_stream->avail_in = static_cast<uInt>(
  1019. src - c_stream->next_in);
  1020. if (c_stream->avail_in) {
  1021. err = deflate(c_stream, Z_NO_FLUSH);
  1022. if (UNIV_UNLIKELY(err != Z_OK)) {
  1023. return(err);
  1024. }
  1025. }
  1026. ut_ad(!c_stream->avail_in);
  1027. ut_ad(c_stream->next_in == src);
  1028. memcpy(storage
  1029. - (DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN)
  1030. * (rec_get_heap_no_new(rec) - 1),
  1031. c_stream->next_in,
  1032. DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  1033. c_stream->next_in
  1034. += DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN;
  1035. /* Skip also roll_ptr */
  1036. ut_ad(trx_id_col + 1 < rec_offs_n_fields(offsets));
  1037. }
  1038. /* Compress the last bytes of the record. */
  1039. c_stream->avail_in = static_cast<uInt>(
  1040. rec + rec_offs_data_size(offsets) - c_stream->next_in);
  1041. if (c_stream->avail_in) {
  1042. err = deflate(c_stream, Z_NO_FLUSH);
  1043. if (UNIV_UNLIKELY(err != Z_OK)) {
  1044. goto func_exit;
  1045. }
  1046. }
  1047. ut_ad(!c_stream->avail_in);
  1048. } while (--n_dense);
  1049. func_exit:
  1050. return(err);}
  1051. /** Attempt to compress a ROW_FORMAT=COMPRESSED page.
  1052. @retval true on success
  1053. @retval false on failure; block->page.zip will be left intact. */
  1054. bool
  1055. page_zip_compress(
  1056. buf_block_t* block, /*!< in/out: buffer block */
  1057. dict_index_t* index, /*!< in: index of the B-tree node */
  1058. ulint level, /*!< in: commpression level */
  1059. mtr_t* mtr) /*!< in/out: mini-transaction */
  1060. {
  1061. z_stream c_stream;
  1062. int err;
  1063. byte* fields; /*!< index field information */
  1064. byte* buf; /*!< compressed payload of the
  1065. page */
  1066. byte* buf_end; /* end of buf */
  1067. ulint n_dense;
  1068. ulint slot_size; /* amount of uncompressed bytes
  1069. per record */
  1070. const rec_t** recs; /*!< dense page directory,
  1071. sorted by address */
  1072. mem_heap_t* heap;
  1073. ulint trx_id_col = ULINT_UNDEFINED;
  1074. ulint n_blobs = 0;
  1075. byte* storage; /* storage of uncompressed
  1076. columns */
  1077. const ulonglong ns = my_interval_timer();
  1078. #ifdef PAGE_ZIP_COMPRESS_DBG
  1079. FILE* logfile = NULL;
  1080. #endif
  1081. /* A local copy of srv_cmp_per_index_enabled to avoid reading that
  1082. variable multiple times in this function since it can be changed at
  1083. anytime. */
  1084. my_bool cmp_per_index_enabled;
  1085. cmp_per_index_enabled = srv_cmp_per_index_enabled;
  1086. page_t* page = block->frame;
  1087. page_zip_des_t* page_zip = &block->page.zip;
  1088. ut_a(page_is_comp(page));
  1089. ut_a(fil_page_index_page_check(page));
  1090. ut_ad(page_simple_validate_new((page_t*) page));
  1091. ut_ad(page_zip_simple_validate(page_zip));
  1092. ut_ad(dict_table_is_comp(index->table));
  1093. ut_ad(!dict_index_is_ibuf(index));
  1094. UNIV_MEM_ASSERT_RW(page, srv_page_size);
  1095. /* Check the data that will be omitted. */
  1096. ut_a(!memcmp(page + (PAGE_NEW_INFIMUM - REC_N_NEW_EXTRA_BYTES),
  1097. infimum_extra, sizeof infimum_extra));
  1098. ut_a(!memcmp(page + PAGE_NEW_INFIMUM,
  1099. infimum_data, sizeof infimum_data));
  1100. ut_a(page[PAGE_NEW_SUPREMUM - REC_N_NEW_EXTRA_BYTES]
  1101. /* info_bits == 0, n_owned <= max */
  1102. <= PAGE_DIR_SLOT_MAX_N_OWNED);
  1103. ut_a(!memcmp(page + (PAGE_NEW_SUPREMUM - REC_N_NEW_EXTRA_BYTES + 1),
  1104. supremum_extra_data, sizeof supremum_extra_data));
  1105. if (page_is_empty(page)) {
  1106. ut_a(rec_get_next_offs(page + PAGE_NEW_INFIMUM, TRUE)
  1107. == PAGE_NEW_SUPREMUM);
  1108. }
  1109. const ulint n_fields = page_is_leaf(page)
  1110. ? dict_index_get_n_fields(index)
  1111. : dict_index_get_n_unique_in_tree_nonleaf(index);
  1112. index_id_t ind_id = index->id;
  1113. /* The dense directory excludes the infimum and supremum records. */
  1114. n_dense = ulint(page_dir_get_n_heap(page)) - PAGE_HEAP_NO_USER_LOW;
  1115. #ifdef PAGE_ZIP_COMPRESS_DBG
  1116. if (UNIV_UNLIKELY(page_zip_compress_dbg)) {
  1117. ib::info() << "compress "
  1118. << static_cast<void*>(page_zip) << " "
  1119. << static_cast<const void*>(page) << " "
  1120. << page_is_leaf(page) << " "
  1121. << n_fields << " " << n_dense;
  1122. }
  1123. if (UNIV_UNLIKELY(page_zip_compress_log)) {
  1124. /* Create a log file for every compression attempt. */
  1125. char logfilename[9];
  1126. snprintf(logfilename, sizeof logfilename,
  1127. "%08x", page_zip_compress_log++);
  1128. logfile = fopen(logfilename, "wb");
  1129. if (logfile) {
  1130. /* Write the uncompressed page to the log. */
  1131. if (fwrite(page, 1, srv_page_size, logfile)
  1132. != srv_page_size) {
  1133. perror("fwrite");
  1134. }
  1135. /* Record the compressed size as zero.
  1136. This will be overwritten at successful exit. */
  1137. putc(0, logfile);
  1138. putc(0, logfile);
  1139. putc(0, logfile);
  1140. putc(0, logfile);
  1141. }
  1142. }
  1143. #endif /* PAGE_ZIP_COMPRESS_DBG */
  1144. page_zip_stat[page_zip->ssize - 1].compressed++;
  1145. if (cmp_per_index_enabled) {
  1146. mutex_enter(&page_zip_stat_per_index_mutex);
  1147. page_zip_stat_per_index[ind_id].compressed++;
  1148. mutex_exit(&page_zip_stat_per_index_mutex);
  1149. }
  1150. if (UNIV_UNLIKELY(n_dense * PAGE_ZIP_DIR_SLOT_SIZE
  1151. >= page_zip_get_size(page_zip))) {
  1152. goto err_exit;
  1153. }
  1154. MONITOR_INC(MONITOR_PAGE_COMPRESS);
  1155. /* Simulate a compression failure with a probability determined by
  1156. innodb_simulate_comp_failures, only if the page has 2 or more
  1157. records. */
  1158. if (srv_simulate_comp_failures
  1159. && !dict_index_is_ibuf(index)
  1160. && page_get_n_recs(page) >= 2
  1161. && ((ulint)(rand() % 100) < srv_simulate_comp_failures)
  1162. && strcmp(index->table->name.m_name, "IBUF_DUMMY")) {
  1163. #ifdef UNIV_DEBUG
  1164. ib::error()
  1165. << "Simulating a compression failure"
  1166. << " for table " << index->table->name
  1167. << " index "
  1168. << index->name()
  1169. << " page "
  1170. << block->page.id.page_no()
  1171. << "("
  1172. << (page_is_leaf(page) ? "leaf" : "non-leaf")
  1173. << ")";
  1174. #endif
  1175. goto err_exit;
  1176. }
  1177. heap = mem_heap_create(page_zip_get_size(page_zip)
  1178. + n_fields * (2 + sizeof(ulint))
  1179. + REC_OFFS_HEADER_SIZE
  1180. + n_dense * ((sizeof *recs)
  1181. - PAGE_ZIP_DIR_SLOT_SIZE)
  1182. + srv_page_size * 4
  1183. + (512 << MAX_MEM_LEVEL));
  1184. recs = static_cast<const rec_t**>(
  1185. mem_heap_zalloc(heap, n_dense * sizeof *recs));
  1186. fields = static_cast<byte*>(mem_heap_alloc(heap, (n_fields + 1) * 2));
  1187. buf = static_cast<byte*>(
  1188. mem_heap_alloc(heap, page_zip_get_size(page_zip) - PAGE_DATA));
  1189. buf_end = buf + page_zip_get_size(page_zip) - PAGE_DATA;
  1190. /* Compress the data payload. */
  1191. page_zip_set_alloc(&c_stream, heap);
  1192. err = deflateInit2(&c_stream, static_cast<int>(level),
  1193. Z_DEFLATED, srv_page_size_shift,
  1194. MAX_MEM_LEVEL, Z_DEFAULT_STRATEGY);
  1195. ut_a(err == Z_OK);
  1196. c_stream.next_out = buf;
  1197. /* Subtract the space reserved for uncompressed data. */
  1198. /* Page header and the end marker of the modification log */
  1199. c_stream.avail_out = static_cast<uInt>(buf_end - buf - 1);
  1200. /* Dense page directory and uncompressed columns, if any */
  1201. if (page_is_leaf(page)) {
  1202. if (dict_index_is_clust(index)) {
  1203. trx_id_col = index->db_trx_id();
  1204. slot_size = PAGE_ZIP_DIR_SLOT_SIZE
  1205. + DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN;
  1206. } else {
  1207. /* Signal the absence of trx_id
  1208. in page_zip_fields_encode() */
  1209. trx_id_col = 0;
  1210. slot_size = PAGE_ZIP_DIR_SLOT_SIZE;
  1211. }
  1212. } else {
  1213. slot_size = PAGE_ZIP_DIR_SLOT_SIZE + REC_NODE_PTR_SIZE;
  1214. trx_id_col = ULINT_UNDEFINED;
  1215. }
  1216. if (UNIV_UNLIKELY(c_stream.avail_out <= n_dense * slot_size
  1217. + 6/* sizeof(zlib header and footer) */)) {
  1218. goto zlib_error;
  1219. }
  1220. c_stream.avail_out -= uInt(n_dense * slot_size);
  1221. c_stream.avail_in = uInt(page_zip_fields_encode(n_fields, index,
  1222. trx_id_col, fields));
  1223. c_stream.next_in = fields;
  1224. if (UNIV_LIKELY(!trx_id_col)) {
  1225. trx_id_col = ULINT_UNDEFINED;
  1226. }
  1227. UNIV_MEM_ASSERT_RW(c_stream.next_in, c_stream.avail_in);
  1228. err = deflate(&c_stream, Z_FULL_FLUSH);
  1229. if (err != Z_OK) {
  1230. goto zlib_error;
  1231. }
  1232. ut_ad(!c_stream.avail_in);
  1233. page_zip_dir_encode(page, buf_end, recs);
  1234. c_stream.next_in = (byte*) page + PAGE_ZIP_START;
  1235. storage = buf_end - n_dense * PAGE_ZIP_DIR_SLOT_SIZE;
  1236. /* Compress the records in heap_no order. */
  1237. if (UNIV_UNLIKELY(!n_dense)) {
  1238. } else if (!page_is_leaf(page)) {
  1239. /* This is a node pointer page. */
  1240. err = page_zip_compress_node_ptrs(LOGFILE
  1241. &c_stream, recs, n_dense,
  1242. index, storage, heap);
  1243. if (UNIV_UNLIKELY(err != Z_OK)) {
  1244. goto zlib_error;
  1245. }
  1246. } else if (UNIV_LIKELY(trx_id_col == ULINT_UNDEFINED)) {
  1247. /* This is a leaf page in a secondary index. */
  1248. err = page_zip_compress_sec(LOGFILE
  1249. &c_stream, recs, n_dense);
  1250. if (UNIV_UNLIKELY(err != Z_OK)) {
  1251. goto zlib_error;
  1252. }
  1253. } else {
  1254. /* This is a leaf page in a clustered index. */
  1255. err = page_zip_compress_clust(LOGFILE
  1256. &c_stream, recs, n_dense,
  1257. index, &n_blobs, trx_id_col,
  1258. buf_end - PAGE_ZIP_DIR_SLOT_SIZE
  1259. * page_get_n_recs(page),
  1260. storage, heap);
  1261. if (UNIV_UNLIKELY(err != Z_OK)) {
  1262. goto zlib_error;
  1263. }
  1264. }
  1265. /* Finish the compression. */
  1266. ut_ad(!c_stream.avail_in);
  1267. /* Compress any trailing garbage, in case the last record was
  1268. allocated from an originally longer space on the free list,
  1269. or the data of the last record from page_zip_compress_sec(). */
  1270. c_stream.avail_in = static_cast<uInt>(
  1271. page_header_get_field(page, PAGE_HEAP_TOP)
  1272. - (c_stream.next_in - page));
  1273. ut_a(c_stream.avail_in <= srv_page_size - PAGE_ZIP_START - PAGE_DIR);
  1274. UNIV_MEM_ASSERT_RW(c_stream.next_in, c_stream.avail_in);
  1275. err = deflate(&c_stream, Z_FINISH);
  1276. if (UNIV_UNLIKELY(err != Z_STREAM_END)) {
  1277. zlib_error:
  1278. deflateEnd(&c_stream);
  1279. mem_heap_free(heap);
  1280. err_exit:
  1281. #ifdef PAGE_ZIP_COMPRESS_DBG
  1282. if (logfile) {
  1283. fclose(logfile);
  1284. }
  1285. #endif /* PAGE_ZIP_COMPRESS_DBG */
  1286. if (page_is_leaf(page)) {
  1287. dict_index_zip_failure(index);
  1288. }
  1289. const uint64_t time_diff = (my_interval_timer() - ns) / 1000;
  1290. page_zip_stat[page_zip->ssize - 1].compressed_usec
  1291. += time_diff;
  1292. if (cmp_per_index_enabled) {
  1293. mutex_enter(&page_zip_stat_per_index_mutex);
  1294. page_zip_stat_per_index[ind_id].compressed_usec
  1295. += time_diff;
  1296. mutex_exit(&page_zip_stat_per_index_mutex);
  1297. }
  1298. return false;
  1299. }
  1300. err = deflateEnd(&c_stream);
  1301. ut_a(err == Z_OK);
  1302. ut_ad(buf + c_stream.total_out == c_stream.next_out);
  1303. ut_ad((ulint) (storage - c_stream.next_out) >= c_stream.avail_out);
  1304. /* Valgrind believes that zlib does not initialize some bits
  1305. in the last 7 or 8 bytes of the stream. Make Valgrind happy. */
  1306. UNIV_MEM_VALID(buf, c_stream.total_out);
  1307. /* Zero out the area reserved for the modification log.
  1308. Space for the end marker of the modification log is not
  1309. included in avail_out. */
  1310. memset(c_stream.next_out, 0, c_stream.avail_out + 1/* end marker */);
  1311. #ifdef UNIV_DEBUG
  1312. page_zip->m_start =
  1313. #endif /* UNIV_DEBUG */
  1314. page_zip->m_end = unsigned(PAGE_DATA + c_stream.total_out);
  1315. page_zip->m_nonempty = FALSE;
  1316. page_zip->n_blobs = unsigned(n_blobs);
  1317. /* Copy those header fields that will not be written
  1318. in buf_flush_init_for_writing() */
  1319. memcpy_aligned<8>(page_zip->data + FIL_PAGE_PREV, page + FIL_PAGE_PREV,
  1320. FIL_PAGE_LSN - FIL_PAGE_PREV);
  1321. memcpy_aligned<2>(page_zip->data + FIL_PAGE_TYPE, page + FIL_PAGE_TYPE,
  1322. 2);
  1323. memcpy_aligned<2>(page_zip->data + FIL_PAGE_DATA, page + FIL_PAGE_DATA,
  1324. PAGE_DATA - FIL_PAGE_DATA);
  1325. /* Copy the rest of the compressed page */
  1326. memcpy_aligned<2>(page_zip->data + PAGE_DATA, buf,
  1327. page_zip_get_size(page_zip) - PAGE_DATA);
  1328. mem_heap_free(heap);
  1329. #ifdef UNIV_ZIP_DEBUG
  1330. ut_a(page_zip_validate(page_zip, page, index));
  1331. #endif /* UNIV_ZIP_DEBUG */
  1332. page_zip_compress_write_log(block, index, mtr);
  1333. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  1334. #ifdef PAGE_ZIP_COMPRESS_DBG
  1335. if (logfile) {
  1336. /* Record the compressed size of the block. */
  1337. byte sz[4];
  1338. mach_write_to_4(sz, c_stream.total_out);
  1339. fseek(logfile, srv_page_size, SEEK_SET);
  1340. if (fwrite(sz, 1, sizeof sz, logfile) != sizeof sz) {
  1341. perror("fwrite");
  1342. }
  1343. fclose(logfile);
  1344. }
  1345. #endif /* PAGE_ZIP_COMPRESS_DBG */
  1346. const uint64_t time_diff = (my_interval_timer() - ns) / 1000;
  1347. page_zip_stat[page_zip->ssize - 1].compressed_ok++;
  1348. page_zip_stat[page_zip->ssize - 1].compressed_usec += time_diff;
  1349. if (cmp_per_index_enabled) {
  1350. mutex_enter(&page_zip_stat_per_index_mutex);
  1351. page_zip_stat_per_index[ind_id].compressed_ok++;
  1352. page_zip_stat_per_index[ind_id].compressed_usec += time_diff;
  1353. mutex_exit(&page_zip_stat_per_index_mutex);
  1354. }
  1355. if (page_is_leaf(page)) {
  1356. dict_index_zip_success(index);
  1357. }
  1358. return true;
  1359. }
  1360. /**********************************************************************//**
  1361. Deallocate the index information initialized by page_zip_fields_decode(). */
  1362. static
  1363. void
  1364. page_zip_fields_free(
  1365. /*=================*/
  1366. dict_index_t* index) /*!< in: dummy index to be freed */
  1367. {
  1368. if (index) {
  1369. dict_table_t* table = index->table;
  1370. mutex_free(&index->zip_pad.mutex);
  1371. mem_heap_free(index->heap);
  1372. dict_mem_table_free(table);
  1373. }
  1374. }
  1375. /**********************************************************************//**
  1376. Read the index information for the compressed page.
  1377. @return own: dummy index describing the page, or NULL on error */
  1378. static
  1379. dict_index_t*
  1380. page_zip_fields_decode(
  1381. /*===================*/
  1382. const byte* buf, /*!< in: index information */
  1383. const byte* end, /*!< in: end of buf */
  1384. ulint* trx_id_col,/*!< in: NULL for non-leaf pages;
  1385. for leaf pages, pointer to where to store
  1386. the position of the trx_id column */
  1387. bool is_spatial)/*< in: is spatial index or not */
  1388. {
  1389. const byte* b;
  1390. ulint n;
  1391. ulint i;
  1392. ulint val;
  1393. dict_table_t* table;
  1394. dict_index_t* index;
  1395. /* Determine the number of fields. */
  1396. for (b = buf, n = 0; b < end; n++) {
  1397. if (*b++ & 0x80) {
  1398. b++; /* skip the second byte */
  1399. }
  1400. }
  1401. n--; /* n_nullable or trx_id */
  1402. if (UNIV_UNLIKELY(n > REC_MAX_N_FIELDS)) {
  1403. page_zip_fail(("page_zip_fields_decode: n = %lu\n",
  1404. (ulong) n));
  1405. return(NULL);
  1406. }
  1407. if (UNIV_UNLIKELY(b > end)) {
  1408. page_zip_fail(("page_zip_fields_decode: %p > %p\n",
  1409. (const void*) b, (const void*) end));
  1410. return(NULL);
  1411. }
  1412. table = dict_mem_table_create("ZIP_DUMMY", NULL, n, 0,
  1413. DICT_TF_COMPACT, 0);
  1414. index = dict_mem_index_create(table, "ZIP_DUMMY", 0, n);
  1415. index->n_uniq = unsigned(n);
  1416. /* avoid ut_ad(index->cached) in dict_index_get_n_unique_in_tree */
  1417. index->cached = TRUE;
  1418. /* Initialize the fields. */
  1419. for (b = buf, i = 0; i < n; i++) {
  1420. ulint mtype;
  1421. ulint len;
  1422. val = *b++;
  1423. if (UNIV_UNLIKELY(val & 0x80)) {
  1424. /* fixed length > 62 bytes */
  1425. val = (val & 0x7f) << 8 | *b++;
  1426. len = val >> 1;
  1427. mtype = DATA_FIXBINARY;
  1428. } else if (UNIV_UNLIKELY(val >= 126)) {
  1429. /* variable length with max > 255 bytes */
  1430. len = 0x7fff;
  1431. mtype = DATA_BINARY;
  1432. } else if (val <= 1) {
  1433. /* variable length with max <= 255 bytes */
  1434. len = 0;
  1435. mtype = DATA_BINARY;
  1436. } else {
  1437. /* fixed length < 62 bytes */
  1438. len = val >> 1;
  1439. mtype = DATA_FIXBINARY;
  1440. }
  1441. dict_mem_table_add_col(table, NULL, NULL, mtype,
  1442. val & 1 ? DATA_NOT_NULL : 0, len);
  1443. dict_index_add_col(index, table,
  1444. dict_table_get_nth_col(table, i), 0);
  1445. }
  1446. val = *b++;
  1447. if (UNIV_UNLIKELY(val & 0x80)) {
  1448. val = (val & 0x7f) << 8 | *b++;
  1449. }
  1450. /* Decode the position of the trx_id column. */
  1451. if (trx_id_col) {
  1452. if (!val) {
  1453. val = ULINT_UNDEFINED;
  1454. } else if (UNIV_UNLIKELY(val >= n)) {
  1455. page_zip_fields_free(index);
  1456. index = NULL;
  1457. } else {
  1458. index->type = DICT_CLUSTERED;
  1459. }
  1460. *trx_id_col = val;
  1461. } else {
  1462. /* Decode the number of nullable fields. */
  1463. if (UNIV_UNLIKELY(index->n_nullable > val)) {
  1464. page_zip_fields_free(index);
  1465. index = NULL;
  1466. } else {
  1467. index->n_nullable = unsigned(val);
  1468. }
  1469. }
  1470. /* ROW_FORMAT=COMPRESSED does not support instant ADD COLUMN */
  1471. index->n_core_fields = index->n_fields;
  1472. index->n_core_null_bytes
  1473. = UT_BITS_IN_BYTES(unsigned(index->n_nullable));
  1474. ut_ad(b == end);
  1475. if (is_spatial) {
  1476. index->type |= DICT_SPATIAL;
  1477. }
  1478. return(index);
  1479. }
  1480. /**********************************************************************//**
  1481. Populate the sparse page directory from the dense directory.
  1482. @return TRUE on success, FALSE on failure */
  1483. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  1484. ibool
  1485. page_zip_dir_decode(
  1486. /*================*/
  1487. const page_zip_des_t* page_zip,/*!< in: dense page directory on
  1488. compressed page */
  1489. page_t* page, /*!< in: compact page with valid header;
  1490. out: trailer and sparse page directory
  1491. filled in */
  1492. rec_t** recs, /*!< out: dense page directory sorted by
  1493. ascending address (and heap_no) */
  1494. ulint n_dense)/*!< in: number of user records, and
  1495. size of recs[] */
  1496. {
  1497. ulint i;
  1498. ulint n_recs;
  1499. byte* slot;
  1500. n_recs = page_get_n_recs(page);
  1501. if (UNIV_UNLIKELY(n_recs > n_dense)) {
  1502. page_zip_fail(("page_zip_dir_decode 1: %lu > %lu\n",
  1503. (ulong) n_recs, (ulong) n_dense));
  1504. return(FALSE);
  1505. }
  1506. /* Traverse the list of stored records in the sorting order,
  1507. starting from the first user record. */
  1508. slot = page + (srv_page_size - PAGE_DIR - PAGE_DIR_SLOT_SIZE);
  1509. UNIV_PREFETCH_RW(slot);
  1510. /* Zero out the page trailer. */
  1511. memset(slot + PAGE_DIR_SLOT_SIZE, 0, PAGE_DIR);
  1512. mach_write_to_2(slot, PAGE_NEW_INFIMUM);
  1513. slot -= PAGE_DIR_SLOT_SIZE;
  1514. UNIV_PREFETCH_RW(slot);
  1515. /* Initialize the sparse directory and copy the dense directory. */
  1516. for (i = 0; i < n_recs; i++) {
  1517. ulint offs = page_zip_dir_get(page_zip, i);
  1518. if (offs & PAGE_ZIP_DIR_SLOT_OWNED) {
  1519. mach_write_to_2(slot, offs & PAGE_ZIP_DIR_SLOT_MASK);
  1520. slot -= PAGE_DIR_SLOT_SIZE;
  1521. UNIV_PREFETCH_RW(slot);
  1522. }
  1523. if (UNIV_UNLIKELY((offs & PAGE_ZIP_DIR_SLOT_MASK)
  1524. < PAGE_ZIP_START + REC_N_NEW_EXTRA_BYTES)) {
  1525. page_zip_fail(("page_zip_dir_decode 2: %u %u %lx\n",
  1526. (unsigned) i, (unsigned) n_recs,
  1527. (ulong) offs));
  1528. return(FALSE);
  1529. }
  1530. recs[i] = page + (offs & PAGE_ZIP_DIR_SLOT_MASK);
  1531. }
  1532. mach_write_to_2(slot, PAGE_NEW_SUPREMUM);
  1533. {
  1534. const page_dir_slot_t* last_slot = page_dir_get_nth_slot(
  1535. page, page_dir_get_n_slots(page) - 1U);
  1536. if (UNIV_UNLIKELY(slot != last_slot)) {
  1537. page_zip_fail(("page_zip_dir_decode 3: %p != %p\n",
  1538. (const void*) slot,
  1539. (const void*) last_slot));
  1540. return(FALSE);
  1541. }
  1542. }
  1543. /* Copy the rest of the dense directory. */
  1544. for (; i < n_dense; i++) {
  1545. ulint offs = page_zip_dir_get(page_zip, i);
  1546. if (UNIV_UNLIKELY(offs & ~PAGE_ZIP_DIR_SLOT_MASK)) {
  1547. page_zip_fail(("page_zip_dir_decode 4: %u %u %lx\n",
  1548. (unsigned) i, (unsigned) n_dense,
  1549. (ulong) offs));
  1550. return(FALSE);
  1551. }
  1552. recs[i] = page + offs;
  1553. }
  1554. std::sort(recs, recs + n_dense);
  1555. return(TRUE);
  1556. }
  1557. /**********************************************************************//**
  1558. Initialize the REC_N_NEW_EXTRA_BYTES of each record.
  1559. @return TRUE on success, FALSE on failure */
  1560. static
  1561. ibool
  1562. page_zip_set_extra_bytes(
  1563. /*=====================*/
  1564. const page_zip_des_t* page_zip,/*!< in: compressed page */
  1565. page_t* page, /*!< in/out: uncompressed page */
  1566. ulint info_bits)/*!< in: REC_INFO_MIN_REC_FLAG or 0 */
  1567. {
  1568. ulint n;
  1569. ulint i;
  1570. ulint n_owned = 1;
  1571. ulint offs;
  1572. rec_t* rec;
  1573. n = page_get_n_recs(page);
  1574. rec = page + PAGE_NEW_INFIMUM;
  1575. for (i = 0; i < n; i++) {
  1576. offs = page_zip_dir_get(page_zip, i);
  1577. if (offs & PAGE_ZIP_DIR_SLOT_DEL) {
  1578. info_bits |= REC_INFO_DELETED_FLAG;
  1579. }
  1580. if (UNIV_UNLIKELY(offs & PAGE_ZIP_DIR_SLOT_OWNED)) {
  1581. info_bits |= n_owned;
  1582. n_owned = 1;
  1583. } else {
  1584. n_owned++;
  1585. }
  1586. offs &= PAGE_ZIP_DIR_SLOT_MASK;
  1587. if (UNIV_UNLIKELY(offs < PAGE_ZIP_START
  1588. + REC_N_NEW_EXTRA_BYTES)) {
  1589. page_zip_fail(("page_zip_set_extra_bytes 1:"
  1590. " %u %u %lx\n",
  1591. (unsigned) i, (unsigned) n,
  1592. (ulong) offs));
  1593. return(FALSE);
  1594. }
  1595. rec_set_next_offs_new(rec, offs);
  1596. rec = page + offs;
  1597. rec[-REC_N_NEW_EXTRA_BYTES] = (byte) info_bits;
  1598. info_bits = 0;
  1599. }
  1600. /* Set the next pointer of the last user record. */
  1601. rec_set_next_offs_new(rec, PAGE_NEW_SUPREMUM);
  1602. /* Set n_owned of the supremum record. */
  1603. page[PAGE_NEW_SUPREMUM - REC_N_NEW_EXTRA_BYTES] = (byte) n_owned;
  1604. /* The dense directory excludes the infimum and supremum records. */
  1605. n = ulint(page_dir_get_n_heap(page)) - PAGE_HEAP_NO_USER_LOW;
  1606. if (i >= n) {
  1607. if (UNIV_LIKELY(i == n)) {
  1608. return(TRUE);
  1609. }
  1610. page_zip_fail(("page_zip_set_extra_bytes 2: %u != %u\n",
  1611. (unsigned) i, (unsigned) n));
  1612. return(FALSE);
  1613. }
  1614. offs = page_zip_dir_get(page_zip, i);
  1615. /* Set the extra bytes of deleted records on the free list. */
  1616. for (;;) {
  1617. if (UNIV_UNLIKELY(!offs)
  1618. || UNIV_UNLIKELY(offs & ~PAGE_ZIP_DIR_SLOT_MASK)) {
  1619. page_zip_fail(("page_zip_set_extra_bytes 3: %lx\n",
  1620. (ulong) offs));
  1621. return(FALSE);
  1622. }
  1623. rec = page + offs;
  1624. rec[-REC_N_NEW_EXTRA_BYTES] = 0; /* info_bits and n_owned */
  1625. if (++i == n) {
  1626. break;
  1627. }
  1628. offs = page_zip_dir_get(page_zip, i);
  1629. rec_set_next_offs_new(rec, offs);
  1630. }
  1631. /* Terminate the free list. */
  1632. rec[-REC_N_NEW_EXTRA_BYTES] = 0; /* info_bits and n_owned */
  1633. rec_set_next_offs_new(rec, 0);
  1634. return(TRUE);
  1635. }
  1636. /**********************************************************************//**
  1637. Apply the modification log to a record containing externally stored
  1638. columns. Do not copy the fields that are stored separately.
  1639. @return pointer to modification log, or NULL on failure */
  1640. static
  1641. const byte*
  1642. page_zip_apply_log_ext(
  1643. /*===================*/
  1644. rec_t* rec, /*!< in/out: record */
  1645. const offset_t* offsets, /*!< in: rec_get_offsets(rec) */
  1646. ulint trx_id_col, /*!< in: position of of DB_TRX_ID */
  1647. const byte* data, /*!< in: modification log */
  1648. const byte* end) /*!< in: end of modification log */
  1649. {
  1650. ulint i;
  1651. ulint len;
  1652. byte* next_out = rec;
  1653. /* Check if there are any externally stored columns.
  1654. For each externally stored column, skip the
  1655. BTR_EXTERN_FIELD_REF. */
  1656. for (i = 0; i < rec_offs_n_fields(offsets); i++) {
  1657. byte* dst;
  1658. if (UNIV_UNLIKELY(i == trx_id_col)) {
  1659. /* Skip trx_id and roll_ptr */
  1660. dst = rec_get_nth_field(rec, offsets,
  1661. i, &len);
  1662. if (UNIV_UNLIKELY(dst - next_out >= end - data)
  1663. || UNIV_UNLIKELY
  1664. (len < (DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN))
  1665. || rec_offs_nth_extern(offsets, i)) {
  1666. page_zip_fail(("page_zip_apply_log_ext:"
  1667. " trx_id len %lu,"
  1668. " %p - %p >= %p - %p\n",
  1669. (ulong) len,
  1670. (const void*) dst,
  1671. (const void*) next_out,
  1672. (const void*) end,
  1673. (const void*) data));
  1674. return(NULL);
  1675. }
  1676. memcpy(next_out, data, ulint(dst - next_out));
  1677. data += ulint(dst - next_out);
  1678. next_out = dst + (DATA_TRX_ID_LEN
  1679. + DATA_ROLL_PTR_LEN);
  1680. } else if (rec_offs_nth_extern(offsets, i)) {
  1681. dst = rec_get_nth_field(rec, offsets,
  1682. i, &len);
  1683. ut_ad(len
  1684. >= BTR_EXTERN_FIELD_REF_SIZE);
  1685. len += ulint(dst - next_out)
  1686. - BTR_EXTERN_FIELD_REF_SIZE;
  1687. if (UNIV_UNLIKELY(data + len >= end)) {
  1688. page_zip_fail(("page_zip_apply_log_ext:"
  1689. " ext %p+%lu >= %p\n",
  1690. (const void*) data,
  1691. (ulong) len,
  1692. (const void*) end));
  1693. return(NULL);
  1694. }
  1695. memcpy(next_out, data, len);
  1696. data += len;
  1697. next_out += len
  1698. + BTR_EXTERN_FIELD_REF_SIZE;
  1699. }
  1700. }
  1701. /* Copy the last bytes of the record. */
  1702. len = ulint(rec_get_end(rec, offsets) - next_out);
  1703. if (UNIV_UNLIKELY(data + len >= end)) {
  1704. page_zip_fail(("page_zip_apply_log_ext:"
  1705. " last %p+%lu >= %p\n",
  1706. (const void*) data,
  1707. (ulong) len,
  1708. (const void*) end));
  1709. return(NULL);
  1710. }
  1711. memcpy(next_out, data, len);
  1712. data += len;
  1713. return(data);
  1714. }
  1715. /**********************************************************************//**
  1716. Apply the modification log to an uncompressed page.
  1717. Do not copy the fields that are stored separately.
  1718. @return pointer to end of modification log, or NULL on failure */
  1719. static
  1720. const byte*
  1721. page_zip_apply_log(
  1722. /*===============*/
  1723. const byte* data, /*!< in: modification log */
  1724. ulint size, /*!< in: maximum length of the log, in bytes */
  1725. rec_t** recs, /*!< in: dense page directory,
  1726. sorted by address (indexed by
  1727. heap_no - PAGE_HEAP_NO_USER_LOW) */
  1728. ulint n_dense,/*!< in: size of recs[] */
  1729. bool is_leaf,/*!< in: whether this is a leaf page */
  1730. ulint trx_id_col,/*!< in: column number of trx_id in the index,
  1731. or ULINT_UNDEFINED if none */
  1732. ulint heap_status,
  1733. /*!< in: heap_no and status bits for
  1734. the next record to uncompress */
  1735. dict_index_t* index, /*!< in: index of the page */
  1736. offset_t* offsets)/*!< in/out: work area for
  1737. rec_get_offsets_reverse() */
  1738. {
  1739. const byte* const end = data + size;
  1740. for (;;) {
  1741. ulint val;
  1742. rec_t* rec;
  1743. ulint len;
  1744. ulint hs;
  1745. val = *data++;
  1746. if (UNIV_UNLIKELY(!val)) {
  1747. return(data - 1);
  1748. }
  1749. if (val & 0x80) {
  1750. val = (val & 0x7f) << 8 | *data++;
  1751. if (UNIV_UNLIKELY(!val)) {
  1752. page_zip_fail(("page_zip_apply_log:"
  1753. " invalid val %x%x\n",
  1754. data[-2], data[-1]));
  1755. return(NULL);
  1756. }
  1757. }
  1758. if (UNIV_UNLIKELY(data >= end)) {
  1759. page_zip_fail(("page_zip_apply_log: %p >= %p\n",
  1760. (const void*) data,
  1761. (const void*) end));
  1762. return(NULL);
  1763. }
  1764. if (UNIV_UNLIKELY((val >> 1) > n_dense)) {
  1765. page_zip_fail(("page_zip_apply_log: %lu>>1 > %lu\n",
  1766. (ulong) val, (ulong) n_dense));
  1767. return(NULL);
  1768. }
  1769. /* Determine the heap number and status bits of the record. */
  1770. rec = recs[(val >> 1) - 1];
  1771. hs = ((val >> 1) + 1) << REC_HEAP_NO_SHIFT;
  1772. hs |= heap_status & ((1 << REC_HEAP_NO_SHIFT) - 1);
  1773. /* This may either be an old record that is being
  1774. overwritten (updated in place, or allocated from
  1775. the free list), or a new record, with the next
  1776. available_heap_no. */
  1777. if (UNIV_UNLIKELY(hs > heap_status)) {
  1778. page_zip_fail(("page_zip_apply_log: %lu > %lu\n",
  1779. (ulong) hs, (ulong) heap_status));
  1780. return(NULL);
  1781. } else if (hs == heap_status) {
  1782. /* A new record was allocated from the heap. */
  1783. if (UNIV_UNLIKELY(val & 1)) {
  1784. /* Only existing records may be cleared. */
  1785. page_zip_fail(("page_zip_apply_log:"
  1786. " attempting to create"
  1787. " deleted rec %lu\n",
  1788. (ulong) hs));
  1789. return(NULL);
  1790. }
  1791. heap_status += 1 << REC_HEAP_NO_SHIFT;
  1792. }
  1793. mach_write_to_2(rec - REC_NEW_HEAP_NO, hs);
  1794. if (val & 1) {
  1795. /* Clear the data bytes of the record. */
  1796. mem_heap_t* heap = NULL;
  1797. offset_t* offs;
  1798. offs = rec_get_offsets(rec, index, offsets, is_leaf,
  1799. ULINT_UNDEFINED, &heap);
  1800. memset(rec, 0, rec_offs_data_size(offs));
  1801. if (UNIV_LIKELY_NULL(heap)) {
  1802. mem_heap_free(heap);
  1803. }
  1804. continue;
  1805. }
  1806. compile_time_assert(REC_STATUS_NODE_PTR == TRUE);
  1807. rec_get_offsets_reverse(data, index,
  1808. hs & REC_STATUS_NODE_PTR,
  1809. offsets);
  1810. /* Silence a debug assertion in rec_offs_make_valid().
  1811. This will be overwritten in page_zip_set_extra_bytes(),
  1812. called by page_zip_decompress_low(). */
  1813. ut_d(rec[-REC_NEW_INFO_BITS] = 0);
  1814. rec_offs_make_valid(rec, index, is_leaf, offsets);
  1815. /* Copy the extra bytes (backwards). */
  1816. {
  1817. byte* start = rec_get_start(rec, offsets);
  1818. byte* b = rec - REC_N_NEW_EXTRA_BYTES;
  1819. while (b != start) {
  1820. *--b = *data++;
  1821. }
  1822. }
  1823. /* Copy the data bytes. */
  1824. if (UNIV_UNLIKELY(rec_offs_any_extern(offsets))) {
  1825. /* Non-leaf nodes should not contain any
  1826. externally stored columns. */
  1827. if (UNIV_UNLIKELY(hs & REC_STATUS_NODE_PTR)) {
  1828. page_zip_fail(("page_zip_apply_log:"
  1829. " %lu&REC_STATUS_NODE_PTR\n",
  1830. (ulong) hs));
  1831. return(NULL);
  1832. }
  1833. data = page_zip_apply_log_ext(
  1834. rec, offsets, trx_id_col, data, end);
  1835. if (UNIV_UNLIKELY(!data)) {
  1836. return(NULL);
  1837. }
  1838. } else if (UNIV_UNLIKELY(hs & REC_STATUS_NODE_PTR)) {
  1839. len = rec_offs_data_size(offsets)
  1840. - REC_NODE_PTR_SIZE;
  1841. /* Copy the data bytes, except node_ptr. */
  1842. if (UNIV_UNLIKELY(data + len >= end)) {
  1843. page_zip_fail(("page_zip_apply_log:"
  1844. " node_ptr %p+%lu >= %p\n",
  1845. (const void*) data,
  1846. (ulong) len,
  1847. (const void*) end));
  1848. return(NULL);
  1849. }
  1850. memcpy(rec, data, len);
  1851. data += len;
  1852. } else if (UNIV_LIKELY(trx_id_col == ULINT_UNDEFINED)) {
  1853. len = rec_offs_data_size(offsets);
  1854. /* Copy all data bytes of
  1855. a record in a secondary index. */
  1856. if (UNIV_UNLIKELY(data + len >= end)) {
  1857. page_zip_fail(("page_zip_apply_log:"
  1858. " sec %p+%lu >= %p\n",
  1859. (const void*) data,
  1860. (ulong) len,
  1861. (const void*) end));
  1862. return(NULL);
  1863. }
  1864. memcpy(rec, data, len);
  1865. data += len;
  1866. } else {
  1867. /* Skip DB_TRX_ID and DB_ROLL_PTR. */
  1868. ulint l = rec_get_nth_field_offs(offsets,
  1869. trx_id_col, &len);
  1870. byte* b;
  1871. if (UNIV_UNLIKELY(data + l >= end)
  1872. || UNIV_UNLIKELY(len < (DATA_TRX_ID_LEN
  1873. + DATA_ROLL_PTR_LEN))) {
  1874. page_zip_fail(("page_zip_apply_log:"
  1875. " trx_id %p+%lu >= %p\n",
  1876. (const void*) data,
  1877. (ulong) l,
  1878. (const void*) end));
  1879. return(NULL);
  1880. }
  1881. /* Copy any preceding data bytes. */
  1882. memcpy(rec, data, l);
  1883. data += l;
  1884. /* Copy any bytes following DB_TRX_ID, DB_ROLL_PTR. */
  1885. b = rec + l + (DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  1886. len = ulint(rec_get_end(rec, offsets) - b);
  1887. if (UNIV_UNLIKELY(data + len >= end)) {
  1888. page_zip_fail(("page_zip_apply_log:"
  1889. " clust %p+%lu >= %p\n",
  1890. (const void*) data,
  1891. (ulong) len,
  1892. (const void*) end));
  1893. return(NULL);
  1894. }
  1895. memcpy(b, data, len);
  1896. data += len;
  1897. }
  1898. }
  1899. }
  1900. /**********************************************************************//**
  1901. Set the heap_no in a record, and skip the fixed-size record header
  1902. that is not included in the d_stream.
  1903. @return TRUE on success, FALSE if d_stream does not end at rec */
  1904. static
  1905. ibool
  1906. page_zip_decompress_heap_no(
  1907. /*========================*/
  1908. z_stream* d_stream, /*!< in/out: compressed page stream */
  1909. rec_t* rec, /*!< in/out: record */
  1910. ulint& heap_status) /*!< in/out: heap_no and status bits */
  1911. {
  1912. if (d_stream->next_out != rec - REC_N_NEW_EXTRA_BYTES) {
  1913. /* n_dense has grown since the page was last compressed. */
  1914. return(FALSE);
  1915. }
  1916. /* Skip the REC_N_NEW_EXTRA_BYTES. */
  1917. d_stream->next_out = rec;
  1918. /* Set heap_no and the status bits. */
  1919. mach_write_to_2(rec - REC_NEW_HEAP_NO, heap_status);
  1920. heap_status += 1 << REC_HEAP_NO_SHIFT;
  1921. return(TRUE);
  1922. }
  1923. /**********************************************************************//**
  1924. Decompress the records of a node pointer page.
  1925. @return TRUE on success, FALSE on failure */
  1926. static
  1927. ibool
  1928. page_zip_decompress_node_ptrs(
  1929. /*==========================*/
  1930. page_zip_des_t* page_zip, /*!< in/out: compressed page */
  1931. z_stream* d_stream, /*!< in/out: compressed page stream */
  1932. rec_t** recs, /*!< in: dense page directory
  1933. sorted by address */
  1934. ulint n_dense, /*!< in: size of recs[] */
  1935. dict_index_t* index, /*!< in: the index of the page */
  1936. offset_t* offsets, /*!< in/out: temporary offsets */
  1937. mem_heap_t* heap) /*!< in: temporary memory heap */
  1938. {
  1939. ulint heap_status = REC_STATUS_NODE_PTR
  1940. | PAGE_HEAP_NO_USER_LOW << REC_HEAP_NO_SHIFT;
  1941. ulint slot;
  1942. const byte* storage;
  1943. /* Subtract the space reserved for uncompressed data. */
  1944. d_stream->avail_in -= static_cast<uInt>(
  1945. n_dense * (PAGE_ZIP_DIR_SLOT_SIZE + REC_NODE_PTR_SIZE));
  1946. /* Decompress the records in heap_no order. */
  1947. for (slot = 0; slot < n_dense; slot++) {
  1948. rec_t* rec = recs[slot];
  1949. d_stream->avail_out = static_cast<uInt>(
  1950. rec - REC_N_NEW_EXTRA_BYTES - d_stream->next_out);
  1951. ut_ad(d_stream->avail_out < srv_page_size
  1952. - PAGE_ZIP_START - PAGE_DIR);
  1953. switch (inflate(d_stream, Z_SYNC_FLUSH)) {
  1954. case Z_STREAM_END:
  1955. page_zip_decompress_heap_no(
  1956. d_stream, rec, heap_status);
  1957. goto zlib_done;
  1958. case Z_OK:
  1959. case Z_BUF_ERROR:
  1960. if (!d_stream->avail_out) {
  1961. break;
  1962. }
  1963. /* fall through */
  1964. default:
  1965. page_zip_fail(("page_zip_decompress_node_ptrs:"
  1966. " 1 inflate(Z_SYNC_FLUSH)=%s\n",
  1967. d_stream->msg));
  1968. goto zlib_error;
  1969. }
  1970. if (!page_zip_decompress_heap_no(
  1971. d_stream, rec, heap_status)) {
  1972. ut_ad(0);
  1973. }
  1974. /* Read the offsets. The status bits are needed here. */
  1975. offsets = rec_get_offsets(rec, index, offsets, false,
  1976. ULINT_UNDEFINED, &heap);
  1977. /* Non-leaf nodes should not have any externally
  1978. stored columns. */
  1979. ut_ad(!rec_offs_any_extern(offsets));
  1980. /* Decompress the data bytes, except node_ptr. */
  1981. d_stream->avail_out =static_cast<uInt>(
  1982. rec_offs_data_size(offsets) - REC_NODE_PTR_SIZE);
  1983. switch (inflate(d_stream, Z_SYNC_FLUSH)) {
  1984. case Z_STREAM_END:
  1985. goto zlib_done;
  1986. case Z_OK:
  1987. case Z_BUF_ERROR:
  1988. if (!d_stream->avail_out) {
  1989. break;
  1990. }
  1991. /* fall through */
  1992. default:
  1993. page_zip_fail(("page_zip_decompress_node_ptrs:"
  1994. " 2 inflate(Z_SYNC_FLUSH)=%s\n",
  1995. d_stream->msg));
  1996. goto zlib_error;
  1997. }
  1998. /* Clear the node pointer in case the record
  1999. will be deleted and the space will be reallocated
  2000. to a smaller record. */
  2001. memset(d_stream->next_out, 0, REC_NODE_PTR_SIZE);
  2002. d_stream->next_out += REC_NODE_PTR_SIZE;
  2003. ut_ad(d_stream->next_out == rec_get_end(rec, offsets));
  2004. }
  2005. /* Decompress any trailing garbage, in case the last record was
  2006. allocated from an originally longer space on the free list. */
  2007. d_stream->avail_out = static_cast<uInt>(
  2008. page_header_get_field(page_zip->data, PAGE_HEAP_TOP)
  2009. - page_offset(d_stream->next_out));
  2010. if (UNIV_UNLIKELY(d_stream->avail_out > srv_page_size
  2011. - PAGE_ZIP_START - PAGE_DIR)) {
  2012. page_zip_fail(("page_zip_decompress_node_ptrs:"
  2013. " avail_out = %u\n",
  2014. d_stream->avail_out));
  2015. goto zlib_error;
  2016. }
  2017. if (UNIV_UNLIKELY(inflate(d_stream, Z_FINISH) != Z_STREAM_END)) {
  2018. page_zip_fail(("page_zip_decompress_node_ptrs:"
  2019. " inflate(Z_FINISH)=%s\n",
  2020. d_stream->msg));
  2021. zlib_error:
  2022. inflateEnd(d_stream);
  2023. return(FALSE);
  2024. }
  2025. /* Note that d_stream->avail_out > 0 may hold here
  2026. if the modification log is nonempty. */
  2027. zlib_done:
  2028. if (UNIV_UNLIKELY(inflateEnd(d_stream) != Z_OK)) {
  2029. ut_error;
  2030. }
  2031. {
  2032. page_t* page = page_align(d_stream->next_out);
  2033. /* Clear the unused heap space on the uncompressed page. */
  2034. memset(d_stream->next_out, 0,
  2035. ulint(page_dir_get_nth_slot(page,
  2036. page_dir_get_n_slots(page)
  2037. - 1U)
  2038. - d_stream->next_out));
  2039. }
  2040. #ifdef UNIV_DEBUG
  2041. page_zip->m_start = unsigned(PAGE_DATA + d_stream->total_in);
  2042. #endif /* UNIV_DEBUG */
  2043. /* Apply the modification log. */
  2044. {
  2045. const byte* mod_log_ptr;
  2046. mod_log_ptr = page_zip_apply_log(d_stream->next_in,
  2047. d_stream->avail_in + 1,
  2048. recs, n_dense, false,
  2049. ULINT_UNDEFINED, heap_status,
  2050. index, offsets);
  2051. if (UNIV_UNLIKELY(!mod_log_ptr)) {
  2052. return(FALSE);
  2053. }
  2054. page_zip->m_end = unsigned(mod_log_ptr - page_zip->data);
  2055. page_zip->m_nonempty = mod_log_ptr != d_stream->next_in;
  2056. }
  2057. if (UNIV_UNLIKELY
  2058. (page_zip_get_trailer_len(page_zip,
  2059. dict_index_is_clust(index))
  2060. + page_zip->m_end >= page_zip_get_size(page_zip))) {
  2061. page_zip_fail(("page_zip_decompress_node_ptrs:"
  2062. " %lu + %lu >= %lu, %lu\n",
  2063. (ulong) page_zip_get_trailer_len(
  2064. page_zip, dict_index_is_clust(index)),
  2065. (ulong) page_zip->m_end,
  2066. (ulong) page_zip_get_size(page_zip),
  2067. (ulong) dict_index_is_clust(index)));
  2068. return(FALSE);
  2069. }
  2070. /* Restore the uncompressed columns in heap_no order. */
  2071. storage = page_zip_dir_start_low(page_zip, n_dense);
  2072. for (slot = 0; slot < n_dense; slot++) {
  2073. rec_t* rec = recs[slot];
  2074. offsets = rec_get_offsets(rec, index, offsets, false,
  2075. ULINT_UNDEFINED, &heap);
  2076. /* Non-leaf nodes should not have any externally
  2077. stored columns. */
  2078. ut_ad(!rec_offs_any_extern(offsets));
  2079. storage -= REC_NODE_PTR_SIZE;
  2080. memcpy(rec_get_end(rec, offsets) - REC_NODE_PTR_SIZE,
  2081. storage, REC_NODE_PTR_SIZE);
  2082. }
  2083. return(TRUE);
  2084. }
  2085. /**********************************************************************//**
  2086. Decompress the records of a leaf node of a secondary index.
  2087. @return TRUE on success, FALSE on failure */
  2088. static
  2089. ibool
  2090. page_zip_decompress_sec(
  2091. /*====================*/
  2092. page_zip_des_t* page_zip, /*!< in/out: compressed page */
  2093. z_stream* d_stream, /*!< in/out: compressed page stream */
  2094. rec_t** recs, /*!< in: dense page directory
  2095. sorted by address */
  2096. ulint n_dense, /*!< in: size of recs[] */
  2097. dict_index_t* index, /*!< in: the index of the page */
  2098. offset_t* offsets) /*!< in/out: temporary offsets */
  2099. {
  2100. ulint heap_status = REC_STATUS_ORDINARY
  2101. | PAGE_HEAP_NO_USER_LOW << REC_HEAP_NO_SHIFT;
  2102. ulint slot;
  2103. ut_a(!dict_index_is_clust(index));
  2104. /* Subtract the space reserved for uncompressed data. */
  2105. d_stream->avail_in -= static_cast<uint>(
  2106. n_dense * PAGE_ZIP_DIR_SLOT_SIZE);
  2107. for (slot = 0; slot < n_dense; slot++) {
  2108. rec_t* rec = recs[slot];
  2109. /* Decompress everything up to this record. */
  2110. d_stream->avail_out = static_cast<uint>(
  2111. rec - REC_N_NEW_EXTRA_BYTES - d_stream->next_out);
  2112. if (UNIV_LIKELY(d_stream->avail_out)) {
  2113. switch (inflate(d_stream, Z_SYNC_FLUSH)) {
  2114. case Z_STREAM_END:
  2115. page_zip_decompress_heap_no(
  2116. d_stream, rec, heap_status);
  2117. goto zlib_done;
  2118. case Z_OK:
  2119. case Z_BUF_ERROR:
  2120. if (!d_stream->avail_out) {
  2121. break;
  2122. }
  2123. /* fall through */
  2124. default:
  2125. page_zip_fail(("page_zip_decompress_sec:"
  2126. " inflate(Z_SYNC_FLUSH)=%s\n",
  2127. d_stream->msg));
  2128. goto zlib_error;
  2129. }
  2130. }
  2131. if (!page_zip_decompress_heap_no(
  2132. d_stream, rec, heap_status)) {
  2133. ut_ad(0);
  2134. }
  2135. }
  2136. /* Decompress the data of the last record and any trailing garbage,
  2137. in case the last record was allocated from an originally longer space
  2138. on the free list. */
  2139. d_stream->avail_out = static_cast<uInt>(
  2140. page_header_get_field(page_zip->data, PAGE_HEAP_TOP)
  2141. - page_offset(d_stream->next_out));
  2142. if (UNIV_UNLIKELY(d_stream->avail_out > srv_page_size
  2143. - PAGE_ZIP_START - PAGE_DIR)) {
  2144. page_zip_fail(("page_zip_decompress_sec:"
  2145. " avail_out = %u\n",
  2146. d_stream->avail_out));
  2147. goto zlib_error;
  2148. }
  2149. if (UNIV_UNLIKELY(inflate(d_stream, Z_FINISH) != Z_STREAM_END)) {
  2150. page_zip_fail(("page_zip_decompress_sec:"
  2151. " inflate(Z_FINISH)=%s\n",
  2152. d_stream->msg));
  2153. zlib_error:
  2154. inflateEnd(d_stream);
  2155. return(FALSE);
  2156. }
  2157. /* Note that d_stream->avail_out > 0 may hold here
  2158. if the modification log is nonempty. */
  2159. zlib_done:
  2160. if (UNIV_UNLIKELY(inflateEnd(d_stream) != Z_OK)) {
  2161. ut_error;
  2162. }
  2163. {
  2164. page_t* page = page_align(d_stream->next_out);
  2165. /* Clear the unused heap space on the uncompressed page. */
  2166. memset(d_stream->next_out, 0,
  2167. ulint(page_dir_get_nth_slot(page,
  2168. page_dir_get_n_slots(page)
  2169. - 1U)
  2170. - d_stream->next_out));
  2171. }
  2172. ut_d(page_zip->m_start = unsigned(PAGE_DATA + d_stream->total_in));
  2173. /* Apply the modification log. */
  2174. {
  2175. const byte* mod_log_ptr;
  2176. mod_log_ptr = page_zip_apply_log(d_stream->next_in,
  2177. d_stream->avail_in + 1,
  2178. recs, n_dense, true,
  2179. ULINT_UNDEFINED, heap_status,
  2180. index, offsets);
  2181. if (UNIV_UNLIKELY(!mod_log_ptr)) {
  2182. return(FALSE);
  2183. }
  2184. page_zip->m_end = unsigned(mod_log_ptr - page_zip->data);
  2185. page_zip->m_nonempty = mod_log_ptr != d_stream->next_in;
  2186. }
  2187. if (UNIV_UNLIKELY(page_zip_get_trailer_len(page_zip, FALSE)
  2188. + page_zip->m_end >= page_zip_get_size(page_zip))) {
  2189. page_zip_fail(("page_zip_decompress_sec: %lu + %lu >= %lu\n",
  2190. (ulong) page_zip_get_trailer_len(
  2191. page_zip, FALSE),
  2192. (ulong) page_zip->m_end,
  2193. (ulong) page_zip_get_size(page_zip)));
  2194. return(FALSE);
  2195. }
  2196. /* There are no uncompressed columns on leaf pages of
  2197. secondary indexes. */
  2198. return(TRUE);
  2199. }
  2200. /**********************************************************************//**
  2201. Decompress a record of a leaf node of a clustered index that contains
  2202. externally stored columns.
  2203. @return TRUE on success */
  2204. static
  2205. ibool
  2206. page_zip_decompress_clust_ext(
  2207. /*==========================*/
  2208. z_stream* d_stream, /*!< in/out: compressed page stream */
  2209. rec_t* rec, /*!< in/out: record */
  2210. const offset_t* offsets, /*!< in: rec_get_offsets(rec) */
  2211. ulint trx_id_col) /*!< in: position of of DB_TRX_ID */
  2212. {
  2213. ulint i;
  2214. for (i = 0; i < rec_offs_n_fields(offsets); i++) {
  2215. ulint len;
  2216. byte* dst;
  2217. if (UNIV_UNLIKELY(i == trx_id_col)) {
  2218. /* Skip trx_id and roll_ptr */
  2219. dst = rec_get_nth_field(rec, offsets, i, &len);
  2220. if (UNIV_UNLIKELY(len < DATA_TRX_ID_LEN
  2221. + DATA_ROLL_PTR_LEN)) {
  2222. page_zip_fail(("page_zip_decompress_clust_ext:"
  2223. " len[%lu] = %lu\n",
  2224. (ulong) i, (ulong) len));
  2225. return(FALSE);
  2226. }
  2227. if (rec_offs_nth_extern(offsets, i)) {
  2228. page_zip_fail(("page_zip_decompress_clust_ext:"
  2229. " DB_TRX_ID at %lu is ext\n",
  2230. (ulong) i));
  2231. return(FALSE);
  2232. }
  2233. d_stream->avail_out = static_cast<uInt>(
  2234. dst - d_stream->next_out);
  2235. switch (inflate(d_stream, Z_SYNC_FLUSH)) {
  2236. case Z_STREAM_END:
  2237. case Z_OK:
  2238. case Z_BUF_ERROR:
  2239. if (!d_stream->avail_out) {
  2240. break;
  2241. }
  2242. /* fall through */
  2243. default:
  2244. page_zip_fail(("page_zip_decompress_clust_ext:"
  2245. " 1 inflate(Z_SYNC_FLUSH)=%s\n",
  2246. d_stream->msg));
  2247. return(FALSE);
  2248. }
  2249. ut_ad(d_stream->next_out == dst);
  2250. /* Clear DB_TRX_ID and DB_ROLL_PTR in order to
  2251. avoid uninitialized bytes in case the record
  2252. is affected by page_zip_apply_log(). */
  2253. memset(dst, 0, DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  2254. d_stream->next_out += DATA_TRX_ID_LEN
  2255. + DATA_ROLL_PTR_LEN;
  2256. } else if (rec_offs_nth_extern(offsets, i)) {
  2257. dst = rec_get_nth_field(rec, offsets, i, &len);
  2258. ut_ad(len >= BTR_EXTERN_FIELD_REF_SIZE);
  2259. dst += len - BTR_EXTERN_FIELD_REF_SIZE;
  2260. d_stream->avail_out = static_cast<uInt>(
  2261. dst - d_stream->next_out);
  2262. switch (inflate(d_stream, Z_SYNC_FLUSH)) {
  2263. case Z_STREAM_END:
  2264. case Z_OK:
  2265. case Z_BUF_ERROR:
  2266. if (!d_stream->avail_out) {
  2267. break;
  2268. }
  2269. /* fall through */
  2270. default:
  2271. page_zip_fail(("page_zip_decompress_clust_ext:"
  2272. " 2 inflate(Z_SYNC_FLUSH)=%s\n",
  2273. d_stream->msg));
  2274. return(FALSE);
  2275. }
  2276. ut_ad(d_stream->next_out == dst);
  2277. /* Clear the BLOB pointer in case
  2278. the record will be deleted and the
  2279. space will not be reused. Note that
  2280. the final initialization of the BLOB
  2281. pointers (copying from "externs"
  2282. or clearing) will have to take place
  2283. only after the page modification log
  2284. has been applied. Otherwise, we
  2285. could end up with an uninitialized
  2286. BLOB pointer when a record is deleted,
  2287. reallocated and deleted. */
  2288. memset(d_stream->next_out, 0,
  2289. BTR_EXTERN_FIELD_REF_SIZE);
  2290. d_stream->next_out
  2291. += BTR_EXTERN_FIELD_REF_SIZE;
  2292. }
  2293. }
  2294. return(TRUE);
  2295. }
  2296. /**********************************************************************//**
  2297. Compress the records of a leaf node of a clustered index.
  2298. @return TRUE on success, FALSE on failure */
  2299. static
  2300. ibool
  2301. page_zip_decompress_clust(
  2302. /*======================*/
  2303. page_zip_des_t* page_zip, /*!< in/out: compressed page */
  2304. z_stream* d_stream, /*!< in/out: compressed page stream */
  2305. rec_t** recs, /*!< in: dense page directory
  2306. sorted by address */
  2307. ulint n_dense, /*!< in: size of recs[] */
  2308. dict_index_t* index, /*!< in: the index of the page */
  2309. ulint trx_id_col, /*!< index of the trx_id column */
  2310. offset_t* offsets, /*!< in/out: temporary offsets */
  2311. mem_heap_t* heap) /*!< in: temporary memory heap */
  2312. {
  2313. int err;
  2314. ulint slot;
  2315. ulint heap_status = REC_STATUS_ORDINARY
  2316. | PAGE_HEAP_NO_USER_LOW << REC_HEAP_NO_SHIFT;
  2317. const byte* storage;
  2318. const byte* externs;
  2319. ut_a(dict_index_is_clust(index));
  2320. /* Subtract the space reserved for uncompressed data. */
  2321. d_stream->avail_in -= static_cast<uInt>(n_dense)
  2322. * (PAGE_ZIP_CLUST_LEAF_SLOT_SIZE);
  2323. /* Decompress the records in heap_no order. */
  2324. for (slot = 0; slot < n_dense; slot++) {
  2325. rec_t* rec = recs[slot];
  2326. d_stream->avail_out =static_cast<uInt>(
  2327. rec - REC_N_NEW_EXTRA_BYTES - d_stream->next_out);
  2328. ut_ad(d_stream->avail_out < srv_page_size
  2329. - PAGE_ZIP_START - PAGE_DIR);
  2330. err = inflate(d_stream, Z_SYNC_FLUSH);
  2331. switch (err) {
  2332. case Z_STREAM_END:
  2333. page_zip_decompress_heap_no(
  2334. d_stream, rec, heap_status);
  2335. goto zlib_done;
  2336. case Z_OK:
  2337. case Z_BUF_ERROR:
  2338. if (UNIV_LIKELY(!d_stream->avail_out)) {
  2339. break;
  2340. }
  2341. /* fall through */
  2342. default:
  2343. page_zip_fail(("page_zip_decompress_clust:"
  2344. " 1 inflate(Z_SYNC_FLUSH)=%s\n",
  2345. d_stream->msg));
  2346. goto zlib_error;
  2347. }
  2348. if (!page_zip_decompress_heap_no(
  2349. d_stream, rec, heap_status)) {
  2350. ut_ad(0);
  2351. }
  2352. /* Read the offsets. The status bits are needed here. */
  2353. offsets = rec_get_offsets(rec, index, offsets, true,
  2354. ULINT_UNDEFINED, &heap);
  2355. /* This is a leaf page in a clustered index. */
  2356. /* Check if there are any externally stored columns.
  2357. For each externally stored column, restore the
  2358. BTR_EXTERN_FIELD_REF separately. */
  2359. if (rec_offs_any_extern(offsets)) {
  2360. if (UNIV_UNLIKELY
  2361. (!page_zip_decompress_clust_ext(
  2362. d_stream, rec, offsets, trx_id_col))) {
  2363. goto zlib_error;
  2364. }
  2365. } else {
  2366. /* Skip trx_id and roll_ptr */
  2367. ulint len;
  2368. byte* dst = rec_get_nth_field(rec, offsets,
  2369. trx_id_col, &len);
  2370. if (UNIV_UNLIKELY(len < DATA_TRX_ID_LEN
  2371. + DATA_ROLL_PTR_LEN)) {
  2372. page_zip_fail(("page_zip_decompress_clust:"
  2373. " len = %lu\n", (ulong) len));
  2374. goto zlib_error;
  2375. }
  2376. d_stream->avail_out = static_cast<uInt>(
  2377. dst - d_stream->next_out);
  2378. switch (inflate(d_stream, Z_SYNC_FLUSH)) {
  2379. case Z_STREAM_END:
  2380. case Z_OK:
  2381. case Z_BUF_ERROR:
  2382. if (!d_stream->avail_out) {
  2383. break;
  2384. }
  2385. /* fall through */
  2386. default:
  2387. page_zip_fail(("page_zip_decompress_clust:"
  2388. " 2 inflate(Z_SYNC_FLUSH)=%s\n",
  2389. d_stream->msg));
  2390. goto zlib_error;
  2391. }
  2392. ut_ad(d_stream->next_out == dst);
  2393. /* Clear DB_TRX_ID and DB_ROLL_PTR in order to
  2394. avoid uninitialized bytes in case the record
  2395. is affected by page_zip_apply_log(). */
  2396. memset(dst, 0, DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  2397. d_stream->next_out += DATA_TRX_ID_LEN
  2398. + DATA_ROLL_PTR_LEN;
  2399. }
  2400. /* Decompress the last bytes of the record. */
  2401. d_stream->avail_out = static_cast<uInt>(
  2402. rec_get_end(rec, offsets) - d_stream->next_out);
  2403. switch (inflate(d_stream, Z_SYNC_FLUSH)) {
  2404. case Z_STREAM_END:
  2405. case Z_OK:
  2406. case Z_BUF_ERROR:
  2407. if (!d_stream->avail_out) {
  2408. break;
  2409. }
  2410. /* fall through */
  2411. default:
  2412. page_zip_fail(("page_zip_decompress_clust:"
  2413. " 3 inflate(Z_SYNC_FLUSH)=%s\n",
  2414. d_stream->msg));
  2415. goto zlib_error;
  2416. }
  2417. }
  2418. /* Decompress any trailing garbage, in case the last record was
  2419. allocated from an originally longer space on the free list. */
  2420. d_stream->avail_out = static_cast<uInt>(
  2421. page_header_get_field(page_zip->data, PAGE_HEAP_TOP)
  2422. - page_offset(d_stream->next_out));
  2423. if (UNIV_UNLIKELY(d_stream->avail_out > srv_page_size
  2424. - PAGE_ZIP_START - PAGE_DIR)) {
  2425. page_zip_fail(("page_zip_decompress_clust:"
  2426. " avail_out = %u\n",
  2427. d_stream->avail_out));
  2428. goto zlib_error;
  2429. }
  2430. if (UNIV_UNLIKELY(inflate(d_stream, Z_FINISH) != Z_STREAM_END)) {
  2431. page_zip_fail(("page_zip_decompress_clust:"
  2432. " inflate(Z_FINISH)=%s\n",
  2433. d_stream->msg));
  2434. zlib_error:
  2435. inflateEnd(d_stream);
  2436. return(FALSE);
  2437. }
  2438. /* Note that d_stream->avail_out > 0 may hold here
  2439. if the modification log is nonempty. */
  2440. zlib_done:
  2441. if (UNIV_UNLIKELY(inflateEnd(d_stream) != Z_OK)) {
  2442. ut_error;
  2443. }
  2444. {
  2445. page_t* page = page_align(d_stream->next_out);
  2446. /* Clear the unused heap space on the uncompressed page. */
  2447. memset(d_stream->next_out, 0,
  2448. ulint(page_dir_get_nth_slot(page,
  2449. page_dir_get_n_slots(page)
  2450. - 1U)
  2451. - d_stream->next_out));
  2452. }
  2453. ut_d(page_zip->m_start = unsigned(PAGE_DATA + d_stream->total_in));
  2454. /* Apply the modification log. */
  2455. {
  2456. const byte* mod_log_ptr;
  2457. mod_log_ptr = page_zip_apply_log(d_stream->next_in,
  2458. d_stream->avail_in + 1,
  2459. recs, n_dense, true,
  2460. trx_id_col, heap_status,
  2461. index, offsets);
  2462. if (UNIV_UNLIKELY(!mod_log_ptr)) {
  2463. return(FALSE);
  2464. }
  2465. page_zip->m_end = unsigned(mod_log_ptr - page_zip->data);
  2466. page_zip->m_nonempty = mod_log_ptr != d_stream->next_in;
  2467. }
  2468. if (UNIV_UNLIKELY(page_zip_get_trailer_len(page_zip, TRUE)
  2469. + page_zip->m_end >= page_zip_get_size(page_zip))) {
  2470. page_zip_fail(("page_zip_decompress_clust: %lu + %lu >= %lu\n",
  2471. (ulong) page_zip_get_trailer_len(
  2472. page_zip, TRUE),
  2473. (ulong) page_zip->m_end,
  2474. (ulong) page_zip_get_size(page_zip)));
  2475. return(FALSE);
  2476. }
  2477. storage = page_zip_dir_start_low(page_zip, n_dense);
  2478. externs = storage - n_dense
  2479. * (DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  2480. /* Restore the uncompressed columns in heap_no order. */
  2481. for (slot = 0; slot < n_dense; slot++) {
  2482. ulint i;
  2483. ulint len;
  2484. byte* dst;
  2485. rec_t* rec = recs[slot];
  2486. bool exists = !page_zip_dir_find_free(
  2487. page_zip, page_offset(rec));
  2488. offsets = rec_get_offsets(rec, index, offsets, true,
  2489. ULINT_UNDEFINED, &heap);
  2490. dst = rec_get_nth_field(rec, offsets,
  2491. trx_id_col, &len);
  2492. ut_ad(len >= DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  2493. storage -= DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN;
  2494. memcpy(dst, storage,
  2495. DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  2496. /* Check if there are any externally stored
  2497. columns in this record. For each externally
  2498. stored column, restore or clear the
  2499. BTR_EXTERN_FIELD_REF. */
  2500. if (!rec_offs_any_extern(offsets)) {
  2501. continue;
  2502. }
  2503. for (i = 0; i < rec_offs_n_fields(offsets); i++) {
  2504. if (!rec_offs_nth_extern(offsets, i)) {
  2505. continue;
  2506. }
  2507. dst = rec_get_nth_field(rec, offsets, i, &len);
  2508. if (UNIV_UNLIKELY(len < BTR_EXTERN_FIELD_REF_SIZE)) {
  2509. page_zip_fail(("page_zip_decompress_clust:"
  2510. " %lu < 20\n",
  2511. (ulong) len));
  2512. return(FALSE);
  2513. }
  2514. dst += len - BTR_EXTERN_FIELD_REF_SIZE;
  2515. if (UNIV_LIKELY(exists)) {
  2516. /* Existing record:
  2517. restore the BLOB pointer */
  2518. externs -= BTR_EXTERN_FIELD_REF_SIZE;
  2519. if (UNIV_UNLIKELY
  2520. (externs < page_zip->data
  2521. + page_zip->m_end)) {
  2522. page_zip_fail(("page_zip_"
  2523. "decompress_clust:"
  2524. " %p < %p + %lu\n",
  2525. (const void*) externs,
  2526. (const void*)
  2527. page_zip->data,
  2528. (ulong)
  2529. page_zip->m_end));
  2530. return(FALSE);
  2531. }
  2532. memcpy(dst, externs,
  2533. BTR_EXTERN_FIELD_REF_SIZE);
  2534. page_zip->n_blobs++;
  2535. } else {
  2536. /* Deleted record:
  2537. clear the BLOB pointer */
  2538. memset(dst, 0,
  2539. BTR_EXTERN_FIELD_REF_SIZE);
  2540. }
  2541. }
  2542. }
  2543. return(TRUE);
  2544. }
  2545. /**********************************************************************//**
  2546. Decompress a page. This function should tolerate errors on the compressed
  2547. page. Instead of letting assertions fail, it will return FALSE if an
  2548. inconsistency is detected.
  2549. @return TRUE on success, FALSE on failure */
  2550. static
  2551. ibool
  2552. page_zip_decompress_low(
  2553. /*====================*/
  2554. page_zip_des_t* page_zip,/*!< in: data, ssize;
  2555. out: m_start, m_end, m_nonempty, n_blobs */
  2556. page_t* page, /*!< out: uncompressed page, may be trashed */
  2557. ibool all) /*!< in: TRUE=decompress the whole page;
  2558. FALSE=verify but do not copy some
  2559. page header fields that should not change
  2560. after page creation */
  2561. {
  2562. z_stream d_stream;
  2563. dict_index_t* index = NULL;
  2564. rec_t** recs; /*!< dense page directory, sorted by address */
  2565. ulint n_dense;/* number of user records on the page */
  2566. ulint trx_id_col = ULINT_UNDEFINED;
  2567. mem_heap_t* heap;
  2568. offset_t* offsets;
  2569. ut_ad(page_zip_simple_validate(page_zip));
  2570. UNIV_MEM_ASSERT_W(page, srv_page_size);
  2571. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  2572. /* The dense directory excludes the infimum and supremum records. */
  2573. n_dense = page_dir_get_n_heap(page_zip->data) - PAGE_HEAP_NO_USER_LOW;
  2574. if (UNIV_UNLIKELY(n_dense * PAGE_ZIP_DIR_SLOT_SIZE
  2575. >= page_zip_get_size(page_zip))) {
  2576. page_zip_fail(("page_zip_decompress 1: %lu %lu\n",
  2577. (ulong) n_dense,
  2578. (ulong) page_zip_get_size(page_zip)));
  2579. return(FALSE);
  2580. }
  2581. heap = mem_heap_create(n_dense * (3 * sizeof *recs) + srv_page_size);
  2582. recs = static_cast<rec_t**>(
  2583. mem_heap_alloc(heap, n_dense * sizeof *recs));
  2584. if (all) {
  2585. /* Copy the page header. */
  2586. memcpy_aligned<2>(page, page_zip->data, PAGE_DATA);
  2587. } else {
  2588. /* Check that the bytes that we skip are identical. */
  2589. #if defined UNIV_DEBUG || defined UNIV_ZIP_DEBUG
  2590. ut_a(!memcmp(FIL_PAGE_TYPE + page,
  2591. FIL_PAGE_TYPE + page_zip->data,
  2592. PAGE_HEADER - FIL_PAGE_TYPE));
  2593. ut_a(!memcmp(PAGE_HEADER + PAGE_LEVEL + page,
  2594. PAGE_HEADER + PAGE_LEVEL + page_zip->data,
  2595. PAGE_DATA - (PAGE_HEADER + PAGE_LEVEL)));
  2596. #endif /* UNIV_DEBUG || UNIV_ZIP_DEBUG */
  2597. /* Copy the mutable parts of the page header. */
  2598. memcpy_aligned<8>(page, page_zip->data, FIL_PAGE_TYPE);
  2599. memcpy_aligned<2>(PAGE_HEADER + page,
  2600. PAGE_HEADER + page_zip->data,
  2601. PAGE_LEVEL - PAGE_N_DIR_SLOTS);
  2602. #if defined UNIV_DEBUG || defined UNIV_ZIP_DEBUG
  2603. /* Check that the page headers match after copying. */
  2604. ut_a(!memcmp(page, page_zip->data, PAGE_DATA));
  2605. #endif /* UNIV_DEBUG || UNIV_ZIP_DEBUG */
  2606. }
  2607. #ifdef UNIV_ZIP_DEBUG
  2608. /* Clear the uncompressed page, except the header. */
  2609. memset(PAGE_DATA + page, 0x55, srv_page_size - PAGE_DATA);
  2610. #endif /* UNIV_ZIP_DEBUG */
  2611. UNIV_MEM_INVALID(PAGE_DATA + page, srv_page_size - PAGE_DATA);
  2612. /* Copy the page directory. */
  2613. if (UNIV_UNLIKELY(!page_zip_dir_decode(page_zip, page, recs,
  2614. n_dense))) {
  2615. zlib_error:
  2616. mem_heap_free(heap);
  2617. return(FALSE);
  2618. }
  2619. /* Copy the infimum and supremum records. */
  2620. memcpy(page + (PAGE_NEW_INFIMUM - REC_N_NEW_EXTRA_BYTES),
  2621. infimum_extra, sizeof infimum_extra);
  2622. if (page_is_empty(page)) {
  2623. rec_set_next_offs_new(page + PAGE_NEW_INFIMUM,
  2624. PAGE_NEW_SUPREMUM);
  2625. } else {
  2626. rec_set_next_offs_new(page + PAGE_NEW_INFIMUM,
  2627. page_zip_dir_get(page_zip, 0)
  2628. & PAGE_ZIP_DIR_SLOT_MASK);
  2629. }
  2630. memcpy(page + PAGE_NEW_INFIMUM, infimum_data, sizeof infimum_data);
  2631. memcpy_aligned<4>(PAGE_NEW_SUPREMUM - REC_N_NEW_EXTRA_BYTES + 1
  2632. + page, supremum_extra_data,
  2633. sizeof supremum_extra_data);
  2634. page_zip_set_alloc(&d_stream, heap);
  2635. d_stream.next_in = page_zip->data + PAGE_DATA;
  2636. /* Subtract the space reserved for
  2637. the page header and the end marker of the modification log. */
  2638. d_stream.avail_in = static_cast<uInt>(
  2639. page_zip_get_size(page_zip) - (PAGE_DATA + 1));
  2640. d_stream.next_out = page + PAGE_ZIP_START;
  2641. d_stream.avail_out = uInt(srv_page_size - PAGE_ZIP_START);
  2642. if (UNIV_UNLIKELY(inflateInit2(&d_stream, srv_page_size_shift)
  2643. != Z_OK)) {
  2644. ut_error;
  2645. }
  2646. /* Decode the zlib header and the index information. */
  2647. if (UNIV_UNLIKELY(inflate(&d_stream, Z_BLOCK) != Z_OK)) {
  2648. page_zip_fail(("page_zip_decompress:"
  2649. " 1 inflate(Z_BLOCK)=%s\n", d_stream.msg));
  2650. goto zlib_error;
  2651. }
  2652. if (UNIV_UNLIKELY(inflate(&d_stream, Z_BLOCK) != Z_OK)) {
  2653. page_zip_fail(("page_zip_decompress:"
  2654. " 2 inflate(Z_BLOCK)=%s\n", d_stream.msg));
  2655. goto zlib_error;
  2656. }
  2657. index = page_zip_fields_decode(
  2658. page + PAGE_ZIP_START, d_stream.next_out,
  2659. page_is_leaf(page) ? &trx_id_col : NULL,
  2660. fil_page_get_type(page) == FIL_PAGE_RTREE);
  2661. if (UNIV_UNLIKELY(!index)) {
  2662. goto zlib_error;
  2663. }
  2664. /* Decompress the user records. */
  2665. page_zip->n_blobs = 0;
  2666. d_stream.next_out = page + PAGE_ZIP_START;
  2667. {
  2668. /* Pre-allocate the offsets for rec_get_offsets_reverse(). */
  2669. ulint n = 1 + 1/* node ptr */ + REC_OFFS_HEADER_SIZE
  2670. + dict_index_get_n_fields(index);
  2671. offsets = static_cast<offset_t*>(
  2672. mem_heap_alloc(heap, n * sizeof(ulint)));
  2673. rec_offs_set_n_alloc(offsets, n);
  2674. }
  2675. /* Decompress the records in heap_no order. */
  2676. if (!page_is_leaf(page)) {
  2677. /* This is a node pointer page. */
  2678. ulint info_bits;
  2679. if (UNIV_UNLIKELY
  2680. (!page_zip_decompress_node_ptrs(page_zip, &d_stream,
  2681. recs, n_dense, index,
  2682. offsets, heap))) {
  2683. goto err_exit;
  2684. }
  2685. info_bits = page_has_prev(page) ? 0 : REC_INFO_MIN_REC_FLAG;
  2686. if (UNIV_UNLIKELY(!page_zip_set_extra_bytes(page_zip, page,
  2687. info_bits))) {
  2688. goto err_exit;
  2689. }
  2690. } else if (UNIV_LIKELY(trx_id_col == ULINT_UNDEFINED)) {
  2691. /* This is a leaf page in a secondary index. */
  2692. if (UNIV_UNLIKELY(!page_zip_decompress_sec(page_zip, &d_stream,
  2693. recs, n_dense,
  2694. index, offsets))) {
  2695. goto err_exit;
  2696. }
  2697. if (UNIV_UNLIKELY(!page_zip_set_extra_bytes(page_zip,
  2698. page, 0))) {
  2699. err_exit:
  2700. page_zip_fields_free(index);
  2701. mem_heap_free(heap);
  2702. return(FALSE);
  2703. }
  2704. } else {
  2705. /* This is a leaf page in a clustered index. */
  2706. if (UNIV_UNLIKELY(!page_zip_decompress_clust(page_zip,
  2707. &d_stream, recs,
  2708. n_dense, index,
  2709. trx_id_col,
  2710. offsets, heap))) {
  2711. goto err_exit;
  2712. }
  2713. if (UNIV_UNLIKELY(!page_zip_set_extra_bytes(page_zip,
  2714. page, 0))) {
  2715. goto err_exit;
  2716. }
  2717. }
  2718. ut_a(page_is_comp(page));
  2719. UNIV_MEM_ASSERT_RW(page, srv_page_size);
  2720. page_zip_fields_free(index);
  2721. mem_heap_free(heap);
  2722. return(TRUE);
  2723. }
  2724. /**********************************************************************//**
  2725. Decompress a page. This function should tolerate errors on the compressed
  2726. page. Instead of letting assertions fail, it will return FALSE if an
  2727. inconsistency is detected.
  2728. @return TRUE on success, FALSE on failure */
  2729. ibool
  2730. page_zip_decompress(
  2731. /*================*/
  2732. page_zip_des_t* page_zip,/*!< in: data, ssize;
  2733. out: m_start, m_end, m_nonempty, n_blobs */
  2734. page_t* page, /*!< out: uncompressed page, may be trashed */
  2735. ibool all) /*!< in: TRUE=decompress the whole page;
  2736. FALSE=verify but do not copy some
  2737. page header fields that should not change
  2738. after page creation */
  2739. {
  2740. const ulonglong ns = my_interval_timer();
  2741. if (!page_zip_decompress_low(page_zip, page, all)) {
  2742. return(FALSE);
  2743. }
  2744. const uint64_t time_diff = (my_interval_timer() - ns) / 1000;
  2745. page_zip_stat[page_zip->ssize - 1].decompressed++;
  2746. page_zip_stat[page_zip->ssize - 1].decompressed_usec += time_diff;
  2747. index_id_t index_id = btr_page_get_index_id(page);
  2748. if (srv_cmp_per_index_enabled) {
  2749. mutex_enter(&page_zip_stat_per_index_mutex);
  2750. page_zip_stat_per_index[index_id].decompressed++;
  2751. page_zip_stat_per_index[index_id].decompressed_usec += time_diff;
  2752. mutex_exit(&page_zip_stat_per_index_mutex);
  2753. }
  2754. /* Update the stat counter for LRU policy. */
  2755. buf_LRU_stat_inc_unzip();
  2756. MONITOR_INC(MONITOR_PAGE_DECOMPRESS);
  2757. return(TRUE);
  2758. }
  2759. #ifdef UNIV_ZIP_DEBUG
  2760. /**********************************************************************//**
  2761. Dump a block of memory on the standard error stream. */
  2762. static
  2763. void
  2764. page_zip_hexdump_func(
  2765. /*==================*/
  2766. const char* name, /*!< in: name of the data structure */
  2767. const void* buf, /*!< in: data */
  2768. ulint size) /*!< in: length of the data, in bytes */
  2769. {
  2770. const byte* s = static_cast<const byte*>(buf);
  2771. ulint addr;
  2772. const ulint width = 32; /* bytes per line */
  2773. fprintf(stderr, "%s:\n", name);
  2774. for (addr = 0; addr < size; addr += width) {
  2775. ulint i;
  2776. fprintf(stderr, "%04lx ", (ulong) addr);
  2777. i = ut_min(width, size - addr);
  2778. while (i--) {
  2779. fprintf(stderr, "%02x", *s++);
  2780. }
  2781. putc('\n', stderr);
  2782. }
  2783. }
  2784. /** Dump a block of memory on the standard error stream.
  2785. @param buf in: data
  2786. @param size in: length of the data, in bytes */
  2787. #define page_zip_hexdump(buf, size) page_zip_hexdump_func(#buf, buf, size)
  2788. /** Flag: make page_zip_validate() compare page headers only */
  2789. bool page_zip_validate_header_only;
  2790. /**********************************************************************//**
  2791. Check that the compressed and decompressed pages match.
  2792. @return TRUE if valid, FALSE if not */
  2793. ibool
  2794. page_zip_validate_low(
  2795. /*==================*/
  2796. const page_zip_des_t* page_zip,/*!< in: compressed page */
  2797. const page_t* page, /*!< in: uncompressed page */
  2798. const dict_index_t* index, /*!< in: index of the page, if known */
  2799. ibool sloppy) /*!< in: FALSE=strict,
  2800. TRUE=ignore the MIN_REC_FLAG */
  2801. {
  2802. page_zip_des_t temp_page_zip;
  2803. ibool valid;
  2804. if (memcmp(page_zip->data + FIL_PAGE_PREV, page + FIL_PAGE_PREV,
  2805. FIL_PAGE_LSN - FIL_PAGE_PREV)
  2806. || memcmp(page_zip->data + FIL_PAGE_TYPE, page + FIL_PAGE_TYPE, 2)
  2807. || memcmp(page_zip->data + FIL_PAGE_DATA, page + FIL_PAGE_DATA,
  2808. PAGE_DATA - FIL_PAGE_DATA)) {
  2809. page_zip_fail(("page_zip_validate: page header\n"));
  2810. page_zip_hexdump(page_zip, sizeof *page_zip);
  2811. page_zip_hexdump(page_zip->data, page_zip_get_size(page_zip));
  2812. page_zip_hexdump(page, srv_page_size);
  2813. return(FALSE);
  2814. }
  2815. ut_a(page_is_comp(page));
  2816. if (page_zip_validate_header_only) {
  2817. return(TRUE);
  2818. }
  2819. /* page_zip_decompress() expects the uncompressed page to be
  2820. srv_page_size aligned. */
  2821. page_t* temp_page = static_cast<byte*>(aligned_malloc(srv_page_size,
  2822. srv_page_size));
  2823. UNIV_MEM_ASSERT_RW(page, srv_page_size);
  2824. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  2825. temp_page_zip = *page_zip;
  2826. valid = page_zip_decompress_low(&temp_page_zip, temp_page, TRUE);
  2827. if (!valid) {
  2828. fputs("page_zip_validate(): failed to decompress\n", stderr);
  2829. goto func_exit;
  2830. }
  2831. if (page_zip->n_blobs != temp_page_zip.n_blobs) {
  2832. page_zip_fail(("page_zip_validate: n_blobs: %u!=%u\n",
  2833. page_zip->n_blobs, temp_page_zip.n_blobs));
  2834. valid = FALSE;
  2835. }
  2836. #ifdef UNIV_DEBUG
  2837. if (page_zip->m_start != temp_page_zip.m_start) {
  2838. page_zip_fail(("page_zip_validate: m_start: %u!=%u\n",
  2839. page_zip->m_start, temp_page_zip.m_start));
  2840. valid = FALSE;
  2841. }
  2842. #endif /* UNIV_DEBUG */
  2843. if (page_zip->m_end != temp_page_zip.m_end) {
  2844. page_zip_fail(("page_zip_validate: m_end: %u!=%u\n",
  2845. page_zip->m_end, temp_page_zip.m_end));
  2846. valid = FALSE;
  2847. }
  2848. if (page_zip->m_nonempty != temp_page_zip.m_nonempty) {
  2849. page_zip_fail(("page_zip_validate(): m_nonempty: %u!=%u\n",
  2850. page_zip->m_nonempty,
  2851. temp_page_zip.m_nonempty));
  2852. valid = FALSE;
  2853. }
  2854. if (memcmp(page + PAGE_HEADER, temp_page + PAGE_HEADER,
  2855. srv_page_size - PAGE_HEADER - FIL_PAGE_DATA_END)) {
  2856. /* In crash recovery, the "minimum record" flag may be
  2857. set incorrectly until the mini-transaction is
  2858. committed. Let us tolerate that difference when we
  2859. are performing a sloppy validation. */
  2860. offset_t* offsets;
  2861. mem_heap_t* heap;
  2862. const rec_t* rec;
  2863. const rec_t* trec;
  2864. byte info_bits_diff;
  2865. ulint offset
  2866. = rec_get_next_offs(page + PAGE_NEW_INFIMUM, TRUE);
  2867. ut_a(offset >= PAGE_NEW_SUPREMUM);
  2868. offset -= 5/*REC_NEW_INFO_BITS*/;
  2869. info_bits_diff = page[offset] ^ temp_page[offset];
  2870. if (info_bits_diff == REC_INFO_MIN_REC_FLAG) {
  2871. temp_page[offset] = page[offset];
  2872. if (!memcmp(page + PAGE_HEADER,
  2873. temp_page + PAGE_HEADER,
  2874. srv_page_size - PAGE_HEADER
  2875. - FIL_PAGE_DATA_END)) {
  2876. /* Only the minimum record flag
  2877. differed. Let us ignore it. */
  2878. page_zip_fail(("page_zip_validate:"
  2879. " min_rec_flag"
  2880. " (%s%lu,%lu,0x%02lx)\n",
  2881. sloppy ? "ignored, " : "",
  2882. page_get_space_id(page),
  2883. page_get_page_no(page),
  2884. (ulong) page[offset]));
  2885. /* We don't check for spatial index, since
  2886. the "minimum record" could be deleted when
  2887. doing rtr_update_mbr_field.
  2888. GIS_FIXME: need to validate why
  2889. rtr_update_mbr_field.() could affect this */
  2890. if (index && dict_index_is_spatial(index)) {
  2891. valid = true;
  2892. } else {
  2893. valid = sloppy;
  2894. }
  2895. goto func_exit;
  2896. }
  2897. }
  2898. /* Compare the pointers in the PAGE_FREE list. */
  2899. rec = page_header_get_ptr(page, PAGE_FREE);
  2900. trec = page_header_get_ptr(temp_page, PAGE_FREE);
  2901. while (rec || trec) {
  2902. if (page_offset(rec) != page_offset(trec)) {
  2903. page_zip_fail(("page_zip_validate:"
  2904. " PAGE_FREE list: %u!=%u\n",
  2905. (unsigned) page_offset(rec),
  2906. (unsigned) page_offset(trec)));
  2907. valid = FALSE;
  2908. goto func_exit;
  2909. }
  2910. rec = page_rec_get_next_low(rec, TRUE);
  2911. trec = page_rec_get_next_low(trec, TRUE);
  2912. }
  2913. /* Compare the records. */
  2914. heap = NULL;
  2915. offsets = NULL;
  2916. rec = page_rec_get_next_low(
  2917. page + PAGE_NEW_INFIMUM, TRUE);
  2918. trec = page_rec_get_next_low(
  2919. temp_page + PAGE_NEW_INFIMUM, TRUE);
  2920. const bool is_leaf = page_is_leaf(page);
  2921. do {
  2922. if (page_offset(rec) != page_offset(trec)) {
  2923. page_zip_fail(("page_zip_validate:"
  2924. " record list: 0x%02x!=0x%02x\n",
  2925. (unsigned) page_offset(rec),
  2926. (unsigned) page_offset(trec)));
  2927. valid = FALSE;
  2928. break;
  2929. }
  2930. if (index) {
  2931. /* Compare the data. */
  2932. offsets = rec_get_offsets(
  2933. rec, index, offsets, is_leaf,
  2934. ULINT_UNDEFINED, &heap);
  2935. if (memcmp(rec - rec_offs_extra_size(offsets),
  2936. trec - rec_offs_extra_size(offsets),
  2937. rec_offs_size(offsets))) {
  2938. page_zip_fail(
  2939. ("page_zip_validate:"
  2940. " record content: 0x%02x",
  2941. (unsigned) page_offset(rec)));
  2942. valid = FALSE;
  2943. break;
  2944. }
  2945. }
  2946. rec = page_rec_get_next_low(rec, TRUE);
  2947. trec = page_rec_get_next_low(trec, TRUE);
  2948. } while (rec || trec);
  2949. if (heap) {
  2950. mem_heap_free(heap);
  2951. }
  2952. }
  2953. func_exit:
  2954. if (!valid) {
  2955. page_zip_hexdump(page_zip, sizeof *page_zip);
  2956. page_zip_hexdump(page_zip->data, page_zip_get_size(page_zip));
  2957. page_zip_hexdump(page, srv_page_size);
  2958. page_zip_hexdump(temp_page, srv_page_size);
  2959. }
  2960. aligned_free(temp_page);
  2961. return(valid);
  2962. }
  2963. /**********************************************************************//**
  2964. Check that the compressed and decompressed pages match.
  2965. @return TRUE if valid, FALSE if not */
  2966. ibool
  2967. page_zip_validate(
  2968. /*==============*/
  2969. const page_zip_des_t* page_zip,/*!< in: compressed page */
  2970. const page_t* page, /*!< in: uncompressed page */
  2971. const dict_index_t* index) /*!< in: index of the page, if known */
  2972. {
  2973. return(page_zip_validate_low(page_zip, page, index,
  2974. recv_recovery_is_on()));
  2975. }
  2976. #endif /* UNIV_ZIP_DEBUG */
  2977. #ifdef UNIV_DEBUG
  2978. /**********************************************************************//**
  2979. Assert that the compressed and decompressed page headers match.
  2980. @return TRUE */
  2981. static
  2982. ibool
  2983. page_zip_header_cmp(
  2984. /*================*/
  2985. const page_zip_des_t* page_zip,/*!< in: compressed page */
  2986. const byte* page) /*!< in: uncompressed page */
  2987. {
  2988. ut_ad(!memcmp(page_zip->data + FIL_PAGE_PREV, page + FIL_PAGE_PREV,
  2989. FIL_PAGE_LSN - FIL_PAGE_PREV));
  2990. ut_ad(!memcmp(page_zip->data + FIL_PAGE_TYPE, page + FIL_PAGE_TYPE,
  2991. 2));
  2992. ut_ad(!memcmp(page_zip->data + FIL_PAGE_DATA, page + FIL_PAGE_DATA,
  2993. PAGE_DATA - FIL_PAGE_DATA));
  2994. return(TRUE);
  2995. }
  2996. #endif /* UNIV_DEBUG */
  2997. /**********************************************************************//**
  2998. Write a record on the compressed page that contains externally stored
  2999. columns. The data must already have been written to the uncompressed page.
  3000. @return end of modification log */
  3001. static
  3002. byte*
  3003. page_zip_write_rec_ext(
  3004. /*===================*/
  3005. page_zip_des_t* page_zip, /*!< in/out: compressed page */
  3006. const page_t* page, /*!< in: page containing rec */
  3007. const byte* rec, /*!< in: record being written */
  3008. dict_index_t* index, /*!< in: record descriptor */
  3009. const offset_t* offsets, /*!< in: rec_get_offsets(rec, index) */
  3010. ulint create, /*!< in: nonzero=insert, zero=update */
  3011. ulint trx_id_col, /*!< in: position of DB_TRX_ID */
  3012. ulint heap_no, /*!< in: heap number of rec */
  3013. byte* storage, /*!< in: end of dense page directory */
  3014. byte* data) /*!< in: end of modification log */
  3015. {
  3016. const byte* start = rec;
  3017. ulint i;
  3018. ulint len;
  3019. byte* externs = storage;
  3020. ulint n_ext = rec_offs_n_extern(offsets);
  3021. ut_ad(rec_offs_validate(rec, index, offsets));
  3022. UNIV_MEM_ASSERT_RW(rec, rec_offs_data_size(offsets));
  3023. UNIV_MEM_ASSERT_RW(rec - rec_offs_extra_size(offsets),
  3024. rec_offs_extra_size(offsets));
  3025. externs -= (DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN)
  3026. * (page_dir_get_n_heap(page) - PAGE_HEAP_NO_USER_LOW);
  3027. /* Note that this will not take into account
  3028. the BLOB columns of rec if create==TRUE. */
  3029. ut_ad(data + rec_offs_data_size(offsets)
  3030. - (DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN)
  3031. - n_ext * BTR_EXTERN_FIELD_REF_SIZE
  3032. < externs - BTR_EXTERN_FIELD_REF_SIZE * page_zip->n_blobs);
  3033. {
  3034. ulint blob_no = page_zip_get_n_prev_extern(
  3035. page_zip, rec, index);
  3036. byte* ext_end = externs - page_zip->n_blobs
  3037. * BTR_EXTERN_FIELD_REF_SIZE;
  3038. ut_ad(blob_no <= page_zip->n_blobs);
  3039. externs -= blob_no * BTR_EXTERN_FIELD_REF_SIZE;
  3040. if (create) {
  3041. page_zip->n_blobs += static_cast<unsigned>(n_ext);
  3042. ASSERT_ZERO_BLOB(ext_end - n_ext
  3043. * BTR_EXTERN_FIELD_REF_SIZE);
  3044. memmove(ext_end - n_ext
  3045. * BTR_EXTERN_FIELD_REF_SIZE,
  3046. ext_end,
  3047. ulint(externs - ext_end));
  3048. }
  3049. ut_a(blob_no + n_ext <= page_zip->n_blobs);
  3050. }
  3051. for (i = 0; i < rec_offs_n_fields(offsets); i++) {
  3052. const byte* src;
  3053. if (UNIV_UNLIKELY(i == trx_id_col)) {
  3054. ut_ad(!rec_offs_nth_extern(offsets,
  3055. i));
  3056. ut_ad(!rec_offs_nth_extern(offsets,
  3057. i + 1));
  3058. /* Locate trx_id and roll_ptr. */
  3059. src = rec_get_nth_field(rec, offsets,
  3060. i, &len);
  3061. ut_ad(len == DATA_TRX_ID_LEN);
  3062. ut_ad(src + DATA_TRX_ID_LEN
  3063. == rec_get_nth_field(
  3064. rec, offsets,
  3065. i + 1, &len));
  3066. ut_ad(len == DATA_ROLL_PTR_LEN);
  3067. /* Log the preceding fields. */
  3068. ASSERT_ZERO(data, src - start);
  3069. memcpy(data, start, ulint(src - start));
  3070. data += src - start;
  3071. start = src + (DATA_TRX_ID_LEN
  3072. + DATA_ROLL_PTR_LEN);
  3073. /* Store trx_id and roll_ptr. */
  3074. memcpy(storage - (DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN)
  3075. * (heap_no - 1),
  3076. src, DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  3077. i++; /* skip also roll_ptr */
  3078. } else if (rec_offs_nth_extern(offsets, i)) {
  3079. src = rec_get_nth_field(rec, offsets,
  3080. i, &len);
  3081. ut_ad(dict_index_is_clust(index));
  3082. ut_ad(len
  3083. >= BTR_EXTERN_FIELD_REF_SIZE);
  3084. src += len - BTR_EXTERN_FIELD_REF_SIZE;
  3085. ASSERT_ZERO(data, src - start);
  3086. memcpy(data, start, ulint(src - start));
  3087. data += src - start;
  3088. start = src + BTR_EXTERN_FIELD_REF_SIZE;
  3089. /* Store the BLOB pointer. */
  3090. externs -= BTR_EXTERN_FIELD_REF_SIZE;
  3091. ut_ad(data < externs);
  3092. memcpy(externs, src, BTR_EXTERN_FIELD_REF_SIZE);
  3093. }
  3094. }
  3095. /* Log the last bytes of the record. */
  3096. len = rec_offs_data_size(offsets) - ulint(start - rec);
  3097. ASSERT_ZERO(data, len);
  3098. memcpy(data, start, len);
  3099. data += len;
  3100. return(data);
  3101. }
  3102. /**********************************************************************//**
  3103. Write an entire record on the compressed page. The data must already
  3104. have been written to the uncompressed page. */
  3105. void
  3106. page_zip_write_rec(
  3107. /*===============*/
  3108. page_zip_des_t* page_zip,/*!< in/out: compressed page */
  3109. const byte* rec, /*!< in: record being written */
  3110. dict_index_t* index, /*!< in: the index the record belongs to */
  3111. const offset_t* offsets,/*!< in: rec_get_offsets(rec, index) */
  3112. ulint create) /*!< in: nonzero=insert, zero=update */
  3113. {
  3114. const page_t* page;
  3115. byte* data;
  3116. byte* storage;
  3117. ulint heap_no;
  3118. byte* slot;
  3119. ut_ad(page_zip_simple_validate(page_zip));
  3120. ut_ad(page_zip_get_size(page_zip)
  3121. > PAGE_DATA + page_zip_dir_size(page_zip));
  3122. ut_ad(rec_offs_comp(offsets));
  3123. ut_ad(rec_offs_validate(rec, index, offsets));
  3124. ut_ad(page_zip->m_start >= PAGE_DATA);
  3125. page = page_align(rec);
  3126. ut_ad(page_zip_header_cmp(page_zip, page));
  3127. ut_ad(page_simple_validate_new((page_t*) page));
  3128. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  3129. UNIV_MEM_ASSERT_RW(rec, rec_offs_data_size(offsets));
  3130. UNIV_MEM_ASSERT_RW(rec - rec_offs_extra_size(offsets),
  3131. rec_offs_extra_size(offsets));
  3132. slot = page_zip_dir_find(page_zip, page_offset(rec));
  3133. ut_a(slot);
  3134. /* Copy the delete mark. */
  3135. if (rec_get_deleted_flag(rec, TRUE)) {
  3136. /* In delete-marked records, DB_TRX_ID must
  3137. always refer to an existing undo log record.
  3138. On non-leaf pages, the delete-mark flag is garbage. */
  3139. ut_ad(!index->is_primary() || !page_is_leaf(page)
  3140. || row_get_rec_trx_id(rec, index, offsets));
  3141. *slot |= PAGE_ZIP_DIR_SLOT_DEL >> 8;
  3142. } else {
  3143. *slot &= ~(PAGE_ZIP_DIR_SLOT_DEL >> 8);
  3144. }
  3145. ut_ad(rec_get_start((rec_t*) rec, offsets) >= page + PAGE_ZIP_START);
  3146. ut_ad(rec_get_end((rec_t*) rec, offsets) <= page + srv_page_size
  3147. - PAGE_DIR - PAGE_DIR_SLOT_SIZE
  3148. * page_dir_get_n_slots(page));
  3149. heap_no = rec_get_heap_no_new(rec);
  3150. ut_ad(heap_no >= PAGE_HEAP_NO_USER_LOW); /* not infimum or supremum */
  3151. ut_ad(heap_no < page_dir_get_n_heap(page));
  3152. /* Append to the modification log. */
  3153. data = page_zip->data + page_zip->m_end;
  3154. ut_ad(!*data);
  3155. /* Identify the record by writing its heap number - 1.
  3156. 0 is reserved to indicate the end of the modification log. */
  3157. if (UNIV_UNLIKELY(heap_no - 1 >= 64)) {
  3158. *data++ = (byte) (0x80 | (heap_no - 1) >> 7);
  3159. ut_ad(!*data);
  3160. }
  3161. *data++ = (byte) ((heap_no - 1) << 1);
  3162. ut_ad(!*data);
  3163. {
  3164. const byte* start = rec - rec_offs_extra_size(offsets);
  3165. const byte* b = rec - REC_N_NEW_EXTRA_BYTES;
  3166. /* Write the extra bytes backwards, so that
  3167. rec_offs_extra_size() can be easily computed in
  3168. page_zip_apply_log() by invoking
  3169. rec_get_offsets_reverse(). */
  3170. while (b != start) {
  3171. *data++ = *--b;
  3172. ut_ad(!*data);
  3173. }
  3174. }
  3175. /* Write the data bytes. Store the uncompressed bytes separately. */
  3176. storage = page_zip_dir_start(page_zip);
  3177. if (page_is_leaf(page)) {
  3178. ulint len;
  3179. if (dict_index_is_clust(index)) {
  3180. /* Store separately trx_id, roll_ptr and
  3181. the BTR_EXTERN_FIELD_REF of each BLOB column. */
  3182. if (rec_offs_any_extern(offsets)) {
  3183. data = page_zip_write_rec_ext(
  3184. page_zip, page,
  3185. rec, index, offsets, create,
  3186. index->db_trx_id(), heap_no,
  3187. storage, data);
  3188. } else {
  3189. /* Locate trx_id and roll_ptr. */
  3190. const byte* src
  3191. = rec_get_nth_field(rec, offsets,
  3192. index->db_trx_id(),
  3193. &len);
  3194. ut_ad(len == DATA_TRX_ID_LEN);
  3195. ut_ad(src + DATA_TRX_ID_LEN
  3196. == rec_get_nth_field(
  3197. rec, offsets,
  3198. index->db_roll_ptr(), &len));
  3199. ut_ad(len == DATA_ROLL_PTR_LEN);
  3200. /* Log the preceding fields. */
  3201. ASSERT_ZERO(data, src - rec);
  3202. memcpy(data, rec, ulint(src - rec));
  3203. data += src - rec;
  3204. /* Store trx_id and roll_ptr. */
  3205. memcpy(storage
  3206. - (DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN)
  3207. * (heap_no - 1),
  3208. src,
  3209. DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  3210. src += DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN;
  3211. /* Log the last bytes of the record. */
  3212. len = rec_offs_data_size(offsets)
  3213. - ulint(src - rec);
  3214. ASSERT_ZERO(data, len);
  3215. memcpy(data, src, len);
  3216. data += len;
  3217. }
  3218. } else {
  3219. /* Leaf page of a secondary index:
  3220. no externally stored columns */
  3221. ut_ad(!rec_offs_any_extern(offsets));
  3222. /* Log the entire record. */
  3223. len = rec_offs_data_size(offsets);
  3224. ASSERT_ZERO(data, len);
  3225. memcpy(data, rec, len);
  3226. data += len;
  3227. }
  3228. } else {
  3229. /* This is a node pointer page. */
  3230. ulint len;
  3231. /* Non-leaf nodes should not have any externally
  3232. stored columns. */
  3233. ut_ad(!rec_offs_any_extern(offsets));
  3234. /* Copy the data bytes, except node_ptr. */
  3235. len = rec_offs_data_size(offsets) - REC_NODE_PTR_SIZE;
  3236. ut_ad(data + len < storage - REC_NODE_PTR_SIZE
  3237. * (page_dir_get_n_heap(page) - PAGE_HEAP_NO_USER_LOW));
  3238. ASSERT_ZERO(data, len);
  3239. memcpy(data, rec, len);
  3240. data += len;
  3241. /* Copy the node pointer to the uncompressed area. */
  3242. memcpy(storage - REC_NODE_PTR_SIZE
  3243. * (heap_no - 1),
  3244. rec + len,
  3245. REC_NODE_PTR_SIZE);
  3246. }
  3247. ut_a(!*data);
  3248. ut_ad((ulint) (data - page_zip->data) < page_zip_get_size(page_zip));
  3249. page_zip->m_end = unsigned(data - page_zip->data);
  3250. page_zip->m_nonempty = TRUE;
  3251. #ifdef UNIV_ZIP_DEBUG
  3252. ut_a(page_zip_validate(page_zip, page_align(rec), index));
  3253. #endif /* UNIV_ZIP_DEBUG */
  3254. }
  3255. /***********************************************************//**
  3256. Parses a log record of writing a BLOB pointer of a record.
  3257. @return end of log record or NULL */
  3258. const byte*
  3259. page_zip_parse_write_blob_ptr(
  3260. /*==========================*/
  3261. const byte* ptr, /*!< in: redo log buffer */
  3262. const byte* end_ptr,/*!< in: redo log buffer end */
  3263. page_t* page, /*!< in/out: uncompressed page */
  3264. page_zip_des_t* page_zip)/*!< in/out: compressed page */
  3265. {
  3266. ulint offset;
  3267. ulint z_offset;
  3268. ut_ad(ptr != NULL);
  3269. ut_ad(end_ptr != NULL);
  3270. ut_ad(!page == !page_zip);
  3271. if (UNIV_UNLIKELY
  3272. (end_ptr < ptr + (2 + 2 + BTR_EXTERN_FIELD_REF_SIZE))) {
  3273. return(NULL);
  3274. }
  3275. offset = mach_read_from_2(ptr);
  3276. z_offset = mach_read_from_2(ptr + 2);
  3277. if (offset < PAGE_ZIP_START
  3278. || offset >= srv_page_size
  3279. || z_offset >= srv_page_size) {
  3280. corrupt:
  3281. recv_sys.found_corrupt_log = TRUE;
  3282. return(NULL);
  3283. }
  3284. if (page) {
  3285. if (!page_zip || !page_is_leaf(page)) {
  3286. goto corrupt;
  3287. }
  3288. #ifdef UNIV_ZIP_DEBUG
  3289. ut_a(page_zip_validate(page_zip, page, NULL));
  3290. #endif /* UNIV_ZIP_DEBUG */
  3291. memcpy(page + offset,
  3292. ptr + 4, BTR_EXTERN_FIELD_REF_SIZE);
  3293. memcpy(page_zip->data + z_offset,
  3294. ptr + 4, BTR_EXTERN_FIELD_REF_SIZE);
  3295. #ifdef UNIV_ZIP_DEBUG
  3296. ut_a(page_zip_validate(page_zip, page, NULL));
  3297. #endif /* UNIV_ZIP_DEBUG */
  3298. }
  3299. return(ptr + (2 + 2 + BTR_EXTERN_FIELD_REF_SIZE));
  3300. }
  3301. /**********************************************************************//**
  3302. Write a BLOB pointer of a record on the leaf page of a clustered index.
  3303. The information must already have been updated on the uncompressed page. */
  3304. void
  3305. page_zip_write_blob_ptr(
  3306. /*====================*/
  3307. buf_block_t* block, /*!< in/out: ROW_FORMAT=COMPRESSED page */
  3308. const byte* rec, /*!< in/out: record whose data is being
  3309. written */
  3310. dict_index_t* index, /*!< in: index of the page */
  3311. const offset_t* offsets,/*!< in: rec_get_offsets(rec, index) */
  3312. ulint n, /*!< in: column index */
  3313. mtr_t* mtr) /*!< in/out: mini-transaction */
  3314. {
  3315. const byte* field;
  3316. byte* externs;
  3317. const page_t* const page = block->frame;
  3318. page_zip_des_t* const page_zip = &block->page.zip;
  3319. ulint blob_no;
  3320. ulint len;
  3321. ut_ad(page_align(rec) == page);
  3322. ut_ad(index != NULL);
  3323. ut_ad(offsets != NULL);
  3324. ut_ad(page_simple_validate_new((page_t*) page));
  3325. ut_ad(page_zip_simple_validate(page_zip));
  3326. ut_ad(page_zip_get_size(page_zip)
  3327. > PAGE_DATA + page_zip_dir_size(page_zip));
  3328. ut_ad(rec_offs_comp(offsets));
  3329. ut_ad(rec_offs_validate(rec, NULL, offsets));
  3330. ut_ad(rec_offs_any_extern(offsets));
  3331. ut_ad(rec_offs_nth_extern(offsets, n));
  3332. ut_ad(page_zip->m_start >= PAGE_DATA);
  3333. ut_ad(page_zip_header_cmp(page_zip, page));
  3334. ut_ad(page_is_leaf(page));
  3335. ut_ad(dict_index_is_clust(index));
  3336. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  3337. UNIV_MEM_ASSERT_RW(rec, rec_offs_data_size(offsets));
  3338. UNIV_MEM_ASSERT_RW(rec - rec_offs_extra_size(offsets),
  3339. rec_offs_extra_size(offsets));
  3340. blob_no = page_zip_get_n_prev_extern(page_zip, rec, index)
  3341. + rec_get_n_extern_new(rec, index, n);
  3342. ut_a(blob_no < page_zip->n_blobs);
  3343. externs = page_zip->data + page_zip_get_size(page_zip)
  3344. - (page_dir_get_n_heap(page) - PAGE_HEAP_NO_USER_LOW)
  3345. * PAGE_ZIP_CLUST_LEAF_SLOT_SIZE;
  3346. field = rec_get_nth_field(rec, offsets, n, &len);
  3347. externs -= (blob_no + 1) * BTR_EXTERN_FIELD_REF_SIZE;
  3348. field += len - BTR_EXTERN_FIELD_REF_SIZE;
  3349. memcpy(externs, field, BTR_EXTERN_FIELD_REF_SIZE);
  3350. #ifdef UNIV_ZIP_DEBUG
  3351. ut_a(page_zip_validate(page_zip, page, index));
  3352. #endif /* UNIV_ZIP_DEBUG */
  3353. if (byte* log_ptr = mlog_open(mtr, 11 + 2 + 2 + FIELD_REF_SIZE)) {
  3354. log_ptr = mlog_write_initial_log_record_low(
  3355. MLOG_ZIP_WRITE_BLOB_PTR,
  3356. block->page.id.space(), block->page.id.page_no(),
  3357. log_ptr, mtr);
  3358. mach_write_to_2(log_ptr, page_offset(field));
  3359. log_ptr += 2;
  3360. mach_write_to_2(log_ptr, ulint(externs - page_zip->data));
  3361. log_ptr += 2;
  3362. memcpy(log_ptr, externs, BTR_EXTERN_FIELD_REF_SIZE);
  3363. log_ptr += BTR_EXTERN_FIELD_REF_SIZE;
  3364. mlog_close(mtr, log_ptr);
  3365. }
  3366. }
  3367. /***********************************************************//**
  3368. Parses a log record of writing the node pointer of a record.
  3369. @return end of log record or NULL */
  3370. const byte*
  3371. page_zip_parse_write_node_ptr(
  3372. /*==========================*/
  3373. const byte* ptr, /*!< in: redo log buffer */
  3374. const byte* end_ptr,/*!< in: redo log buffer end */
  3375. page_t* page, /*!< in/out: uncompressed page */
  3376. page_zip_des_t* page_zip)/*!< in/out: compressed page */
  3377. {
  3378. ulint offset;
  3379. ulint z_offset;
  3380. ut_ad(ptr != NULL);
  3381. ut_ad(end_ptr!= NULL);
  3382. ut_ad(!page == !page_zip);
  3383. if (UNIV_UNLIKELY(end_ptr < ptr + (2 + 2 + REC_NODE_PTR_SIZE))) {
  3384. return(NULL);
  3385. }
  3386. offset = mach_read_from_2(ptr);
  3387. z_offset = mach_read_from_2(ptr + 2);
  3388. if (offset < PAGE_ZIP_START
  3389. || offset >= srv_page_size
  3390. || z_offset >= srv_page_size) {
  3391. corrupt:
  3392. recv_sys.found_corrupt_log = TRUE;
  3393. return(NULL);
  3394. }
  3395. if (page) {
  3396. byte* storage_end;
  3397. byte* field;
  3398. byte* storage;
  3399. ulint heap_no;
  3400. if (!page_zip || page_is_leaf(page)) {
  3401. goto corrupt;
  3402. }
  3403. #ifdef UNIV_ZIP_DEBUG
  3404. ut_a(page_zip_validate(page_zip, page, NULL));
  3405. #endif /* UNIV_ZIP_DEBUG */
  3406. field = page + offset;
  3407. storage = page_zip->data + z_offset;
  3408. storage_end = page_zip_dir_start(page_zip);
  3409. heap_no = 1 + ulint(storage_end - storage) / REC_NODE_PTR_SIZE;
  3410. if (UNIV_UNLIKELY((storage_end - storage) % REC_NODE_PTR_SIZE)
  3411. || UNIV_UNLIKELY(heap_no < PAGE_HEAP_NO_USER_LOW)
  3412. || UNIV_UNLIKELY(heap_no >= page_dir_get_n_heap(page))) {
  3413. goto corrupt;
  3414. }
  3415. memcpy(field, ptr + 4, REC_NODE_PTR_SIZE);
  3416. memcpy(storage, ptr + 4, REC_NODE_PTR_SIZE);
  3417. #ifdef UNIV_ZIP_DEBUG
  3418. ut_a(page_zip_validate(page_zip, page, NULL));
  3419. #endif /* UNIV_ZIP_DEBUG */
  3420. }
  3421. return(ptr + (2 + 2 + REC_NODE_PTR_SIZE));
  3422. }
  3423. /**********************************************************************//**
  3424. Write the node pointer of a record on a non-leaf compressed page. */
  3425. void
  3426. page_zip_write_node_ptr(
  3427. /*====================*/
  3428. buf_block_t* block, /*!< in/out: compressed page */
  3429. byte* rec, /*!< in/out: record */
  3430. ulint size, /*!< in: data size of rec */
  3431. ulint ptr, /*!< in: node pointer */
  3432. mtr_t* mtr) /*!< in/out: mini-transaction */
  3433. {
  3434. byte* field;
  3435. byte* storage;
  3436. page_zip_des_t* const page_zip = &block->page.zip;
  3437. ut_d(const page_t* const page = block->frame);
  3438. ut_ad(page_simple_validate_new(page));
  3439. ut_ad(page_zip_simple_validate(page_zip));
  3440. ut_ad(page_zip_get_size(page_zip)
  3441. > PAGE_DATA + page_zip_dir_size(page_zip));
  3442. ut_ad(page_rec_is_comp(rec));
  3443. ut_ad(page_zip->m_start >= PAGE_DATA);
  3444. ut_ad(page_zip_header_cmp(page_zip, page));
  3445. ut_ad(!page_is_leaf(page));
  3446. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  3447. UNIV_MEM_ASSERT_RW(rec, size);
  3448. storage = page_zip_dir_start(page_zip)
  3449. - (rec_get_heap_no_new(rec) - 1) * REC_NODE_PTR_SIZE;
  3450. field = rec + size - REC_NODE_PTR_SIZE;
  3451. #if defined UNIV_DEBUG || defined UNIV_ZIP_DEBUG
  3452. ut_a(!memcmp(storage, field, REC_NODE_PTR_SIZE));
  3453. #endif /* UNIV_DEBUG || UNIV_ZIP_DEBUG */
  3454. compile_time_assert(REC_NODE_PTR_SIZE == 4);
  3455. mach_write_to_4(field, ptr);
  3456. memcpy(storage, field, REC_NODE_PTR_SIZE);
  3457. if (byte* log_ptr = mlog_open(mtr, 11 + 2 + 2 + REC_NODE_PTR_SIZE)) {
  3458. log_ptr = mlog_write_initial_log_record_low(
  3459. MLOG_ZIP_WRITE_NODE_PTR,
  3460. block->page.id.space(), block->page.id.page_no(),
  3461. log_ptr, mtr);
  3462. mach_write_to_2(log_ptr, page_offset(field));
  3463. log_ptr += 2;
  3464. mach_write_to_2(log_ptr, ulint(storage - page_zip->data));
  3465. log_ptr += 2;
  3466. memcpy(log_ptr, field, REC_NODE_PTR_SIZE);
  3467. log_ptr += REC_NODE_PTR_SIZE;
  3468. mlog_close(mtr, log_ptr);
  3469. }
  3470. }
  3471. /** Write the DB_TRX_ID,DB_ROLL_PTR into a clustered index leaf page record.
  3472. @param[in,out] page_zip compressed page
  3473. @param[in,out] rec record
  3474. @param[in] offsets rec_get_offsets(rec, index)
  3475. @param[in] trx_id_field field number of DB_TRX_ID (number of PK fields)
  3476. @param[in] trx_id DB_TRX_ID value (transaction identifier)
  3477. @param[in] roll_ptr DB_ROLL_PTR value (undo log pointer)
  3478. @param[in,out] mtr mini-transaction, or NULL to skip logging */
  3479. void
  3480. page_zip_write_trx_id_and_roll_ptr(
  3481. page_zip_des_t* page_zip,
  3482. byte* rec,
  3483. const offset_t* offsets,
  3484. ulint trx_id_col,
  3485. trx_id_t trx_id,
  3486. roll_ptr_t roll_ptr,
  3487. mtr_t* mtr)
  3488. {
  3489. byte* field;
  3490. byte* storage;
  3491. #ifdef UNIV_DEBUG
  3492. page_t* page = page_align(rec);
  3493. #endif /* UNIV_DEBUG */
  3494. ulint len;
  3495. ut_ad(page_simple_validate_new(page));
  3496. ut_ad(page_zip_simple_validate(page_zip));
  3497. ut_ad(page_zip_get_size(page_zip)
  3498. > PAGE_DATA + page_zip_dir_size(page_zip));
  3499. ut_ad(rec_offs_validate(rec, NULL, offsets));
  3500. ut_ad(rec_offs_comp(offsets));
  3501. ut_ad(page_zip->m_start >= PAGE_DATA);
  3502. ut_ad(page_zip_header_cmp(page_zip, page));
  3503. ut_ad(page_is_leaf(page));
  3504. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  3505. storage = page_zip_dir_start(page_zip)
  3506. - (rec_get_heap_no_new(rec) - 1)
  3507. * (DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  3508. compile_time_assert(DATA_TRX_ID + 1 == DATA_ROLL_PTR);
  3509. field = rec_get_nth_field(rec, offsets, trx_id_col, &len);
  3510. ut_ad(len == DATA_TRX_ID_LEN);
  3511. ut_ad(field + DATA_TRX_ID_LEN
  3512. == rec_get_nth_field(rec, offsets, trx_id_col + 1, &len));
  3513. ut_ad(len == DATA_ROLL_PTR_LEN);
  3514. #if defined UNIV_DEBUG || defined UNIV_ZIP_DEBUG
  3515. ut_a(!memcmp(storage, field, DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN));
  3516. #endif /* UNIV_DEBUG || UNIV_ZIP_DEBUG */
  3517. compile_time_assert(DATA_TRX_ID_LEN == 6);
  3518. mach_write_to_6(field, trx_id);
  3519. compile_time_assert(DATA_ROLL_PTR_LEN == 7);
  3520. mach_write_to_7(field + DATA_TRX_ID_LEN, roll_ptr);
  3521. memcpy(storage, field, DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  3522. UNIV_MEM_ASSERT_RW(rec, rec_offs_data_size(offsets));
  3523. UNIV_MEM_ASSERT_RW(rec - rec_offs_extra_size(offsets),
  3524. rec_offs_extra_size(offsets));
  3525. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  3526. if (mtr) {
  3527. byte* log_ptr = mlog_open(
  3528. mtr, 11 + 2 + 2 + DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  3529. if (UNIV_UNLIKELY(!log_ptr)) {
  3530. return;
  3531. }
  3532. log_ptr = mlog_write_initial_log_record_fast(
  3533. (byte*) field, MLOG_ZIP_WRITE_TRX_ID, log_ptr, mtr);
  3534. mach_write_to_2(log_ptr, page_offset(field));
  3535. log_ptr += 2;
  3536. mach_write_to_2(log_ptr, ulint(storage - page_zip->data));
  3537. log_ptr += 2;
  3538. memcpy(log_ptr, field, DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  3539. log_ptr += DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN;
  3540. mlog_close(mtr, log_ptr);
  3541. }
  3542. }
  3543. /** Parse a MLOG_ZIP_WRITE_TRX_ID record.
  3544. @param[in] ptr redo log buffer
  3545. @param[in] end_ptr end of redo log buffer
  3546. @param[in,out] page uncompressed page
  3547. @param[in,out] page_zip compressed page
  3548. @return end of log record
  3549. @retval NULL if the log record is incomplete */
  3550. const byte*
  3551. page_zip_parse_write_trx_id(
  3552. const byte* ptr,
  3553. const byte* end_ptr,
  3554. page_t* page,
  3555. page_zip_des_t* page_zip)
  3556. {
  3557. const byte* const end = 2 + 2 + DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN
  3558. + ptr;
  3559. if (UNIV_UNLIKELY(end_ptr < end)) {
  3560. return(NULL);
  3561. }
  3562. uint offset = mach_read_from_2(ptr);
  3563. uint z_offset = mach_read_from_2(ptr + 2);
  3564. if (offset < PAGE_ZIP_START
  3565. || offset >= srv_page_size
  3566. || z_offset >= srv_page_size) {
  3567. corrupt:
  3568. recv_sys.found_corrupt_log = TRUE;
  3569. return(NULL);
  3570. }
  3571. if (page) {
  3572. if (!page_zip || !page_is_leaf(page)) {
  3573. goto corrupt;
  3574. }
  3575. #ifdef UNIV_ZIP_DEBUG
  3576. ut_a(page_zip_validate(page_zip, page, NULL));
  3577. #endif /* UNIV_ZIP_DEBUG */
  3578. byte* field = page + offset;
  3579. byte* storage = page_zip->data + z_offset;
  3580. if (storage >= page_zip_dir_start(page_zip)) {
  3581. goto corrupt;
  3582. }
  3583. memcpy(field, ptr + 4, DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  3584. memcpy(storage, ptr + 4, DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  3585. #ifdef UNIV_ZIP_DEBUG
  3586. ut_a(page_zip_validate(page_zip, page, NULL));
  3587. #endif /* UNIV_ZIP_DEBUG */
  3588. }
  3589. return end;
  3590. }
  3591. /**********************************************************************//**
  3592. Clear an area on the uncompressed and compressed page.
  3593. Do not clear the data payload, as that would grow the modification log. */
  3594. static
  3595. void
  3596. page_zip_clear_rec(
  3597. /*===============*/
  3598. page_zip_des_t* page_zip, /*!< in/out: compressed page */
  3599. byte* rec, /*!< in: record to clear */
  3600. const dict_index_t* index, /*!< in: index of rec */
  3601. const offset_t* offsets) /*!< in: rec_get_offsets(rec, index) */
  3602. {
  3603. ulint heap_no;
  3604. page_t* page = page_align(rec);
  3605. byte* storage;
  3606. byte* field;
  3607. ulint len;
  3608. /* page_zip_validate() would fail here if a record
  3609. containing externally stored columns is being deleted. */
  3610. ut_ad(rec_offs_validate(rec, index, offsets));
  3611. ut_ad(!page_zip_dir_find(page_zip, page_offset(rec)));
  3612. ut_ad(page_zip_dir_find_free(page_zip, page_offset(rec)));
  3613. ut_ad(page_zip_header_cmp(page_zip, page));
  3614. heap_no = rec_get_heap_no_new(rec);
  3615. ut_ad(heap_no >= PAGE_HEAP_NO_USER_LOW);
  3616. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  3617. UNIV_MEM_ASSERT_RW(rec, rec_offs_data_size(offsets));
  3618. UNIV_MEM_ASSERT_RW(rec - rec_offs_extra_size(offsets),
  3619. rec_offs_extra_size(offsets));
  3620. if (!page_is_leaf(page)) {
  3621. /* Clear node_ptr. On the compressed page,
  3622. there is an array of node_ptr immediately before the
  3623. dense page directory, at the very end of the page. */
  3624. storage = page_zip_dir_start(page_zip);
  3625. ut_ad(dict_index_get_n_unique_in_tree_nonleaf(index) ==
  3626. rec_offs_n_fields(offsets) - 1);
  3627. field = rec_get_nth_field(rec, offsets,
  3628. rec_offs_n_fields(offsets) - 1,
  3629. &len);
  3630. ut_ad(len == REC_NODE_PTR_SIZE);
  3631. ut_ad(!rec_offs_any_extern(offsets));
  3632. memset(field, 0, REC_NODE_PTR_SIZE);
  3633. memset(storage - (heap_no - 1) * REC_NODE_PTR_SIZE,
  3634. 0, REC_NODE_PTR_SIZE);
  3635. } else if (dict_index_is_clust(index)) {
  3636. /* Clear trx_id and roll_ptr. On the compressed page,
  3637. there is an array of these fields immediately before the
  3638. dense page directory, at the very end of the page. */
  3639. const ulint trx_id_pos
  3640. = dict_col_get_clust_pos(
  3641. dict_table_get_sys_col(
  3642. index->table, DATA_TRX_ID), index);
  3643. storage = page_zip_dir_start(page_zip);
  3644. field = rec_get_nth_field(rec, offsets, trx_id_pos, &len);
  3645. ut_ad(len == DATA_TRX_ID_LEN);
  3646. memset(field, 0, DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  3647. memset(storage - (heap_no - 1)
  3648. * (DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN),
  3649. 0, DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  3650. if (rec_offs_any_extern(offsets)) {
  3651. ulint i;
  3652. for (i = rec_offs_n_fields(offsets); i--; ) {
  3653. /* Clear all BLOB pointers in order to make
  3654. page_zip_validate() pass. */
  3655. if (rec_offs_nth_extern(offsets, i)) {
  3656. field = rec_get_nth_field(
  3657. rec, offsets, i, &len);
  3658. ut_ad(len
  3659. == BTR_EXTERN_FIELD_REF_SIZE);
  3660. memset(field + len
  3661. - BTR_EXTERN_FIELD_REF_SIZE,
  3662. 0, BTR_EXTERN_FIELD_REF_SIZE);
  3663. }
  3664. }
  3665. }
  3666. } else {
  3667. ut_ad(!rec_offs_any_extern(offsets));
  3668. }
  3669. }
  3670. /**********************************************************************//**
  3671. Write the "deleted" flag of a record on a compressed page. The flag must
  3672. already have been written on the uncompressed page. */
  3673. void
  3674. page_zip_rec_set_deleted(
  3675. /*=====================*/
  3676. page_zip_des_t* page_zip,/*!< in/out: compressed page */
  3677. const byte* rec, /*!< in: record on the uncompressed page */
  3678. ulint flag) /*!< in: the deleted flag (nonzero=TRUE) */
  3679. {
  3680. byte* slot = page_zip_dir_find(page_zip, page_offset(rec));
  3681. ut_a(slot);
  3682. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  3683. if (flag) {
  3684. *slot |= (PAGE_ZIP_DIR_SLOT_DEL >> 8);
  3685. } else {
  3686. *slot &= ~(PAGE_ZIP_DIR_SLOT_DEL >> 8);
  3687. }
  3688. #ifdef UNIV_ZIP_DEBUG
  3689. ut_a(page_zip_validate(page_zip, page_align(rec), NULL));
  3690. #endif /* UNIV_ZIP_DEBUG */
  3691. }
  3692. /**********************************************************************//**
  3693. Write the "owned" flag of a record on a compressed page. The n_owned field
  3694. must already have been written on the uncompressed page. */
  3695. void
  3696. page_zip_rec_set_owned(
  3697. /*===================*/
  3698. page_zip_des_t* page_zip,/*!< in/out: compressed page */
  3699. const byte* rec, /*!< in: record on the uncompressed page */
  3700. ulint flag) /*!< in: the owned flag (nonzero=TRUE) */
  3701. {
  3702. byte* slot = page_zip_dir_find(page_zip, page_offset(rec));
  3703. ut_a(slot);
  3704. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  3705. if (flag) {
  3706. *slot |= (PAGE_ZIP_DIR_SLOT_OWNED >> 8);
  3707. } else {
  3708. *slot &= ~(PAGE_ZIP_DIR_SLOT_OWNED >> 8);
  3709. }
  3710. }
  3711. /**********************************************************************//**
  3712. Insert a record to the dense page directory. */
  3713. void
  3714. page_zip_dir_insert(
  3715. /*================*/
  3716. page_cur_t* cursor, /*!< in/out: page cursor */
  3717. const byte* free_rec,/*!< in: record from which rec was
  3718. allocated, or NULL */
  3719. byte* rec) /*!< in: record to insert */
  3720. {
  3721. ut_ad(page_align(cursor->rec) == cursor->block->frame);
  3722. ut_ad(page_align(rec) == cursor->block->frame);
  3723. page_zip_des_t *const page_zip= &cursor->block->page.zip;
  3724. ulint n_dense;
  3725. byte* slot_rec;
  3726. byte* slot_free;
  3727. ut_ad(cursor->rec != rec);
  3728. ut_ad(page_rec_get_next_const(cursor->rec) == rec);
  3729. ut_ad(page_zip_simple_validate(page_zip));
  3730. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  3731. if (page_rec_is_infimum(cursor->rec)) {
  3732. /* Use the first slot. */
  3733. slot_rec = page_zip->data + page_zip_get_size(page_zip);
  3734. } else {
  3735. byte* end = page_zip->data + page_zip_get_size(page_zip);
  3736. byte* start = end - page_zip_dir_user_size(page_zip);
  3737. if (UNIV_LIKELY(!free_rec)) {
  3738. /* PAGE_N_RECS was already incremented
  3739. in page_cur_insert_rec_zip(), but the
  3740. dense directory slot at that position
  3741. contains garbage. Skip it. */
  3742. start += PAGE_ZIP_DIR_SLOT_SIZE;
  3743. }
  3744. slot_rec = page_zip_dir_find_low(start, end,
  3745. page_offset(cursor->rec));
  3746. ut_a(slot_rec);
  3747. }
  3748. /* Read the old n_dense (n_heap may have been incremented). */
  3749. n_dense = page_dir_get_n_heap(page_zip->data)
  3750. - (PAGE_HEAP_NO_USER_LOW + 1U);
  3751. if (UNIV_LIKELY_NULL(free_rec)) {
  3752. /* The record was allocated from the free list.
  3753. Shift the dense directory only up to that slot.
  3754. Note that in this case, n_dense is actually
  3755. off by one, because page_cur_insert_rec_zip()
  3756. did not increment n_heap. */
  3757. ut_ad(rec_get_heap_no_new(rec) < n_dense + 1
  3758. + PAGE_HEAP_NO_USER_LOW);
  3759. ut_ad(rec >= free_rec);
  3760. slot_free = page_zip_dir_find(page_zip, page_offset(free_rec));
  3761. ut_ad(slot_free);
  3762. slot_free += PAGE_ZIP_DIR_SLOT_SIZE;
  3763. } else {
  3764. /* The record was allocated from the heap.
  3765. Shift the entire dense directory. */
  3766. ut_ad(rec_get_heap_no_new(rec) == n_dense
  3767. + PAGE_HEAP_NO_USER_LOW);
  3768. /* Shift to the end of the dense page directory. */
  3769. slot_free = page_zip->data + page_zip_get_size(page_zip)
  3770. - PAGE_ZIP_DIR_SLOT_SIZE * n_dense;
  3771. }
  3772. /* Shift the dense directory to allocate place for rec. */
  3773. memmove_aligned<2>(slot_free - PAGE_ZIP_DIR_SLOT_SIZE, slot_free,
  3774. ulint(slot_rec - slot_free));
  3775. /* Write the entry for the inserted record.
  3776. The "owned" and "deleted" flags must be zero. */
  3777. mach_write_to_2(slot_rec - PAGE_ZIP_DIR_SLOT_SIZE, page_offset(rec));
  3778. }
  3779. /**********************************************************************//**
  3780. Shift the dense page directory and the array of BLOB pointers
  3781. when a record is deleted. */
  3782. void
  3783. page_zip_dir_delete(
  3784. /*================*/
  3785. page_zip_des_t* page_zip, /*!< in/out: compressed page */
  3786. byte* rec, /*!< in: deleted record */
  3787. const dict_index_t* index, /*!< in: index of rec */
  3788. const offset_t* offsets, /*!< in: rec_get_offsets(rec) */
  3789. const byte* free) /*!< in: previous start of
  3790. the free list */
  3791. {
  3792. byte* slot_rec;
  3793. byte* slot_free;
  3794. ulint n_ext;
  3795. page_t* page = page_align(rec);
  3796. ut_ad(rec_offs_validate(rec, index, offsets));
  3797. ut_ad(rec_offs_comp(offsets));
  3798. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  3799. UNIV_MEM_ASSERT_RW(rec, rec_offs_data_size(offsets));
  3800. UNIV_MEM_ASSERT_RW(rec - rec_offs_extra_size(offsets),
  3801. rec_offs_extra_size(offsets));
  3802. slot_rec = page_zip_dir_find(page_zip, page_offset(rec));
  3803. ut_a(slot_rec);
  3804. uint16_t n_recs = page_get_n_recs(page);
  3805. ut_ad(n_recs);
  3806. ut_ad(n_recs > 1 || page_get_page_no(page) == index->page);
  3807. /* This could not be done before page_zip_dir_find(). */
  3808. page_header_set_field(page, page_zip, PAGE_N_RECS,
  3809. n_recs - 1);
  3810. if (UNIV_UNLIKELY(!free)) {
  3811. /* Make the last slot the start of the free list. */
  3812. slot_free = page_zip->data + page_zip_get_size(page_zip)
  3813. - PAGE_ZIP_DIR_SLOT_SIZE
  3814. * (page_dir_get_n_heap(page_zip->data)
  3815. - PAGE_HEAP_NO_USER_LOW);
  3816. } else {
  3817. slot_free = page_zip_dir_find_free(page_zip,
  3818. page_offset(free));
  3819. ut_a(slot_free < slot_rec);
  3820. /* Grow the free list by one slot by moving the start. */
  3821. slot_free += PAGE_ZIP_DIR_SLOT_SIZE;
  3822. }
  3823. if (UNIV_LIKELY(slot_rec > slot_free)) {
  3824. memmove_aligned<2>(slot_free + PAGE_ZIP_DIR_SLOT_SIZE,
  3825. slot_free, ulint(slot_rec - slot_free));
  3826. }
  3827. /* Write the entry for the deleted record.
  3828. The "owned" and "deleted" flags will be cleared. */
  3829. mach_write_to_2(slot_free, page_offset(rec));
  3830. if (!page_is_leaf(page) || !dict_index_is_clust(index)) {
  3831. ut_ad(!rec_offs_any_extern(offsets));
  3832. goto skip_blobs;
  3833. }
  3834. n_ext = rec_offs_n_extern(offsets);
  3835. if (UNIV_UNLIKELY(n_ext != 0)) {
  3836. /* Shift and zero fill the array of BLOB pointers. */
  3837. ulint blob_no;
  3838. byte* externs;
  3839. byte* ext_end;
  3840. blob_no = page_zip_get_n_prev_extern(page_zip, rec, index);
  3841. ut_a(blob_no + n_ext <= page_zip->n_blobs);
  3842. externs = page_zip->data + page_zip_get_size(page_zip)
  3843. - (page_dir_get_n_heap(page) - PAGE_HEAP_NO_USER_LOW)
  3844. * PAGE_ZIP_CLUST_LEAF_SLOT_SIZE;
  3845. ext_end = externs - page_zip->n_blobs
  3846. * BTR_EXTERN_FIELD_REF_SIZE;
  3847. externs -= blob_no * BTR_EXTERN_FIELD_REF_SIZE;
  3848. page_zip->n_blobs -= static_cast<unsigned>(n_ext);
  3849. /* Shift and zero fill the array. */
  3850. memmove(ext_end + n_ext * BTR_EXTERN_FIELD_REF_SIZE, ext_end,
  3851. ulint(page_zip->n_blobs - blob_no)
  3852. * BTR_EXTERN_FIELD_REF_SIZE);
  3853. memset(ext_end, 0, n_ext * BTR_EXTERN_FIELD_REF_SIZE);
  3854. }
  3855. skip_blobs:
  3856. /* The compression algorithm expects info_bits and n_owned
  3857. to be 0 for deleted records. */
  3858. rec[-REC_N_NEW_EXTRA_BYTES] = 0; /* info_bits and n_owned */
  3859. page_zip_clear_rec(page_zip, rec, index, offsets);
  3860. }
  3861. /**********************************************************************//**
  3862. Add a slot to the dense page directory. */
  3863. void
  3864. page_zip_dir_add_slot(
  3865. /*==================*/
  3866. page_zip_des_t* page_zip, /*!< in/out: compressed page */
  3867. ulint is_clustered) /*!< in: nonzero for clustered index,
  3868. zero for others */
  3869. {
  3870. ulint n_dense;
  3871. byte* dir;
  3872. byte* stored;
  3873. ut_ad(page_is_comp(page_zip->data));
  3874. UNIV_MEM_ASSERT_RW(page_zip->data, page_zip_get_size(page_zip));
  3875. /* Read the old n_dense (n_heap has already been incremented). */
  3876. n_dense = page_dir_get_n_heap(page_zip->data)
  3877. - (PAGE_HEAP_NO_USER_LOW + 1U);
  3878. dir = page_zip->data + page_zip_get_size(page_zip)
  3879. - PAGE_ZIP_DIR_SLOT_SIZE * n_dense;
  3880. if (!page_is_leaf(page_zip->data)) {
  3881. ut_ad(!page_zip->n_blobs);
  3882. stored = dir - n_dense * REC_NODE_PTR_SIZE;
  3883. } else if (is_clustered) {
  3884. /* Move the BLOB pointer array backwards to make space for the
  3885. roll_ptr and trx_id columns and the dense directory slot. */
  3886. byte* externs;
  3887. stored = dir - n_dense
  3888. * (DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN);
  3889. externs = stored
  3890. - page_zip->n_blobs * BTR_EXTERN_FIELD_REF_SIZE;
  3891. ASSERT_ZERO(externs - PAGE_ZIP_CLUST_LEAF_SLOT_SIZE,
  3892. PAGE_ZIP_CLUST_LEAF_SLOT_SIZE);
  3893. memmove(externs - PAGE_ZIP_CLUST_LEAF_SLOT_SIZE,
  3894. externs, ulint(stored - externs));
  3895. } else {
  3896. stored = dir
  3897. - page_zip->n_blobs * BTR_EXTERN_FIELD_REF_SIZE;
  3898. ASSERT_ZERO(stored - PAGE_ZIP_DIR_SLOT_SIZE,
  3899. static_cast<size_t>(PAGE_ZIP_DIR_SLOT_SIZE));
  3900. }
  3901. /* Move the uncompressed area backwards to make space
  3902. for one directory slot. */
  3903. memmove(stored - PAGE_ZIP_DIR_SLOT_SIZE, stored, ulint(dir - stored));
  3904. }
  3905. /***********************************************************//**
  3906. Parses a log record of writing to the header of a page.
  3907. @return end of log record or NULL */
  3908. const byte*
  3909. page_zip_parse_write_header(
  3910. /*========================*/
  3911. const byte* ptr, /*!< in: redo log buffer */
  3912. const byte* end_ptr,/*!< in: redo log buffer end */
  3913. page_t* page, /*!< in/out: uncompressed page */
  3914. page_zip_des_t* page_zip)/*!< in/out: compressed page */
  3915. {
  3916. ulint offset;
  3917. ulint len;
  3918. ut_ad(ptr != NULL);
  3919. ut_ad(end_ptr!= NULL);
  3920. ut_ad(!page == !page_zip);
  3921. if (UNIV_UNLIKELY(end_ptr < ptr + (1 + 1))) {
  3922. return(NULL);
  3923. }
  3924. offset = (ulint) *ptr++;
  3925. len = (ulint) *ptr++;
  3926. if (len == 0 || offset + len >= PAGE_DATA) {
  3927. corrupt:
  3928. recv_sys.found_corrupt_log = TRUE;
  3929. return(NULL);
  3930. }
  3931. if (end_ptr < ptr + len) {
  3932. return(NULL);
  3933. }
  3934. if (page) {
  3935. if (!page_zip) {
  3936. goto corrupt;
  3937. }
  3938. #ifdef UNIV_ZIP_DEBUG
  3939. ut_a(page_zip_validate(page_zip, page, NULL));
  3940. #endif /* UNIV_ZIP_DEBUG */
  3941. memcpy(page + offset, ptr, len);
  3942. memcpy(page_zip->data + offset, ptr, len);
  3943. #ifdef UNIV_ZIP_DEBUG
  3944. ut_a(page_zip_validate(page_zip, page, NULL));
  3945. #endif /* UNIV_ZIP_DEBUG */
  3946. }
  3947. return(ptr + len);
  3948. }
  3949. /**********************************************************************//**
  3950. Write a log record of writing to the uncompressed header portion of a page. */
  3951. void
  3952. page_zip_write_header_log(
  3953. /*======================*/
  3954. const byte* data, /*!< in: data on the uncompressed page */
  3955. ulint length, /*!< in: length of the data */
  3956. mtr_t* mtr) /*!< in: mini-transaction */
  3957. {
  3958. byte* log_ptr = mlog_open(mtr, 11 + 1 + 1);
  3959. ulint offset = page_offset(data);
  3960. ut_ad(offset < PAGE_DATA);
  3961. ut_ad(offset + length < PAGE_DATA);
  3962. compile_time_assert(PAGE_DATA < 256U);
  3963. ut_ad(length > 0);
  3964. ut_ad(length < 256);
  3965. /* If no logging is requested, we may return now */
  3966. if (UNIV_UNLIKELY(!log_ptr)) {
  3967. return;
  3968. }
  3969. log_ptr = mlog_write_initial_log_record_fast(
  3970. (byte*) data, MLOG_ZIP_WRITE_HEADER, log_ptr, mtr);
  3971. *log_ptr++ = (byte) offset;
  3972. *log_ptr++ = (byte) length;
  3973. mlog_close(mtr, log_ptr);
  3974. mlog_catenate_string(mtr, data, length);
  3975. }
  3976. /**********************************************************************//**
  3977. Reorganize and compress a page. This is a low-level operation for
  3978. compressed pages, to be used when page_zip_compress() fails.
  3979. On success, a redo log entry MLOG_ZIP_PAGE_COMPRESS will be written.
  3980. The function btr_page_reorganize() should be preferred whenever possible.
  3981. IMPORTANT: if page_zip_reorganize() is invoked on a leaf page of a
  3982. non-clustered index, the caller must update the insert buffer free
  3983. bits in the same mini-transaction in such a way that the modification
  3984. will be redo-logged.
  3985. @return TRUE on success, FALSE on failure; page_zip will be left
  3986. intact on failure, but page will be overwritten. */
  3987. ibool
  3988. page_zip_reorganize(
  3989. /*================*/
  3990. buf_block_t* block, /*!< in/out: page with compressed page;
  3991. on the compressed page, in: size;
  3992. out: data, n_blobs,
  3993. m_start, m_end, m_nonempty */
  3994. dict_index_t* index, /*!< in: index of the B-tree node */
  3995. mtr_t* mtr) /*!< in: mini-transaction */
  3996. {
  3997. buf_pool_t* buf_pool = buf_pool_from_block(block);
  3998. page_t* page = buf_block_get_frame(block);
  3999. buf_block_t* temp_block;
  4000. page_t* temp_page;
  4001. ut_ad(mtr_memo_contains(mtr, block, MTR_MEMO_PAGE_X_FIX));
  4002. ut_ad(page_is_comp(page));
  4003. ut_ad(!dict_index_is_ibuf(index));
  4004. ut_ad(!index->table->is_temporary());
  4005. /* Note that page_zip_validate(page_zip, page, index) may fail here. */
  4006. UNIV_MEM_ASSERT_RW(page, srv_page_size);
  4007. UNIV_MEM_ASSERT_RW(buf_block_get_page_zip(block)->data,
  4008. page_zip_get_size(buf_block_get_page_zip(block)));
  4009. /* Disable logging */
  4010. mtr_log_t log_mode = mtr_set_log_mode(mtr, MTR_LOG_NONE);
  4011. temp_block = buf_block_alloc(buf_pool);
  4012. btr_search_drop_page_hash_index(block);
  4013. temp_page = temp_block->frame;
  4014. /* Copy the old page to temporary space */
  4015. memcpy_aligned<UNIV_PAGE_SIZE_MIN>(temp_block->frame, block->frame,
  4016. srv_page_size);
  4017. /* Recreate the page: note that global data on page (possible
  4018. segment headers, next page-field, etc.) is preserved intact */
  4019. page_create(block, mtr, TRUE, dict_index_is_spatial(index));
  4020. /* Copy the records from the temporary space to the recreated page;
  4021. do not copy the lock bits yet */
  4022. page_copy_rec_list_end_no_locks(block, temp_block,
  4023. page_get_infimum_rec(temp_page),
  4024. index, mtr);
  4025. /* Copy the PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC. */
  4026. memcpy_aligned<8>(page + (PAGE_HEADER + PAGE_MAX_TRX_ID),
  4027. temp_page + (PAGE_HEADER + PAGE_MAX_TRX_ID), 8);
  4028. /* PAGE_MAX_TRX_ID must be set on secondary index leaf pages. */
  4029. ut_ad(dict_index_is_clust(index) || !page_is_leaf(temp_page)
  4030. || page_get_max_trx_id(page) != 0);
  4031. /* PAGE_MAX_TRX_ID must be zero on non-leaf pages other than
  4032. clustered index root pages. */
  4033. ut_ad(page_get_max_trx_id(page) == 0
  4034. || (dict_index_is_clust(index)
  4035. ? !page_has_siblings(temp_page)
  4036. : page_is_leaf(temp_page)));
  4037. /* Restore logging. */
  4038. mtr_set_log_mode(mtr, log_mode);
  4039. if (!page_zip_compress(block, index, page_zip_level, mtr)) {
  4040. buf_block_free(temp_block);
  4041. return(FALSE);
  4042. }
  4043. lock_move_reorganize_page(block, temp_block);
  4044. buf_block_free(temp_block);
  4045. return(TRUE);
  4046. }
  4047. /**********************************************************************//**
  4048. Copy the records of a page byte for byte. Do not copy the page header
  4049. or trailer, except those B-tree header fields that are directly
  4050. related to the storage of records. Also copy PAGE_MAX_TRX_ID.
  4051. NOTE: The caller must update the lock table and the adaptive hash index. */
  4052. void
  4053. page_zip_copy_recs(
  4054. buf_block_t* block, /*!< in/out: buffer block */
  4055. const page_zip_des_t* src_zip, /*!< in: compressed page */
  4056. const page_t* src, /*!< in: page */
  4057. dict_index_t* index, /*!< in: index of the B-tree */
  4058. mtr_t* mtr) /*!< in: mini-transaction */
  4059. {
  4060. page_t* page = block->frame;
  4061. page_zip_des_t* page_zip = &block->page.zip;
  4062. ut_ad(mtr_memo_contains_page(mtr, page, MTR_MEMO_PAGE_X_FIX));
  4063. ut_ad(mtr_memo_contains_page(mtr, src, MTR_MEMO_PAGE_X_FIX));
  4064. ut_ad(!dict_index_is_ibuf(index));
  4065. ut_ad(!index->table->is_temporary());
  4066. #ifdef UNIV_ZIP_DEBUG
  4067. /* The B-tree operations that call this function may set
  4068. FIL_PAGE_PREV or PAGE_LEVEL, causing a temporary min_rec_flag
  4069. mismatch. A strict page_zip_validate() will be executed later
  4070. during the B-tree operations. */
  4071. ut_a(page_zip_validate_low(src_zip, src, index, TRUE));
  4072. #endif /* UNIV_ZIP_DEBUG */
  4073. ut_a(page_zip_get_size(page_zip) == page_zip_get_size(src_zip));
  4074. if (UNIV_UNLIKELY(src_zip->n_blobs)) {
  4075. ut_a(page_is_leaf(src));
  4076. ut_a(dict_index_is_clust(index));
  4077. }
  4078. UNIV_MEM_ASSERT_W(page, srv_page_size);
  4079. UNIV_MEM_ASSERT_W(page_zip->data, page_zip_get_size(page_zip));
  4080. UNIV_MEM_ASSERT_RW(src, srv_page_size);
  4081. UNIV_MEM_ASSERT_RW(src_zip->data, page_zip_get_size(page_zip));
  4082. /* Copy those B-tree page header fields that are related to
  4083. the records stored in the page. Also copy the field
  4084. PAGE_MAX_TRX_ID. Skip the rest of the page header and
  4085. trailer. On the compressed page, there is no trailer. */
  4086. compile_time_assert(PAGE_MAX_TRX_ID + 8 == PAGE_HEADER_PRIV_END);
  4087. memcpy_aligned<2>(PAGE_HEADER + page, PAGE_HEADER + src,
  4088. PAGE_HEADER_PRIV_END);
  4089. memcpy_aligned<2>(PAGE_DATA + page, PAGE_DATA + src,
  4090. srv_page_size - (PAGE_DATA + FIL_PAGE_DATA_END));
  4091. memcpy_aligned<2>(PAGE_HEADER + page_zip->data,
  4092. PAGE_HEADER + src_zip->data,
  4093. PAGE_HEADER_PRIV_END);
  4094. memcpy_aligned<2>(PAGE_DATA + page_zip->data,
  4095. PAGE_DATA + src_zip->data,
  4096. page_zip_get_size(page_zip) - PAGE_DATA);
  4097. if (dict_index_is_clust(index)) {
  4098. /* Reset the PAGE_ROOT_AUTO_INC field when copying
  4099. from a root page. */
  4100. memset_aligned<8>(PAGE_HEADER + PAGE_ROOT_AUTO_INC
  4101. + page, 0, 8);
  4102. memset_aligned<8>(PAGE_HEADER + PAGE_ROOT_AUTO_INC
  4103. + page_zip->data, 0, 8);
  4104. } else {
  4105. /* The PAGE_MAX_TRX_ID must be nonzero on leaf pages
  4106. of secondary indexes, and 0 on others. */
  4107. ut_ad(!page_is_leaf(src) == !page_get_max_trx_id(src));
  4108. }
  4109. /* Copy all fields of src_zip to page_zip, except the pointer
  4110. to the compressed data page. */
  4111. {
  4112. page_zip_t* data = page_zip->data;
  4113. memcpy(page_zip, src_zip, sizeof *page_zip);
  4114. page_zip->data = data;
  4115. }
  4116. ut_ad(page_zip_get_trailer_len(page_zip, dict_index_is_clust(index))
  4117. + page_zip->m_end < page_zip_get_size(page_zip));
  4118. if (!page_is_leaf(src)
  4119. && UNIV_UNLIKELY(!page_has_prev(src))
  4120. && UNIV_LIKELY(page_has_prev(page))) {
  4121. /* Clear the REC_INFO_MIN_REC_FLAG of the first user record. */
  4122. ulint offs = rec_get_next_offs(page + PAGE_NEW_INFIMUM,
  4123. TRUE);
  4124. if (UNIV_LIKELY(offs != PAGE_NEW_SUPREMUM)) {
  4125. rec_t* rec = page + offs;
  4126. ut_a(rec[-REC_N_NEW_EXTRA_BYTES]
  4127. & REC_INFO_MIN_REC_FLAG);
  4128. rec[-REC_N_NEW_EXTRA_BYTES] &= ~ REC_INFO_MIN_REC_FLAG;
  4129. }
  4130. }
  4131. #ifdef UNIV_ZIP_DEBUG
  4132. ut_a(page_zip_validate(page_zip, page, index));
  4133. #endif /* UNIV_ZIP_DEBUG */
  4134. page_zip_compress_write_log(block, index, mtr);
  4135. }
  4136. /** Parse and optionally apply MLOG_ZIP_PAGE_COMPRESS.
  4137. @param[in] ptr log record
  4138. @param[in] end_ptr end of log
  4139. @param[in,out] block ROW_FORMAT=COMPRESSED block, or NULL for parsing only
  4140. @return end of log record
  4141. @retval NULL if the log record is incomplete */
  4142. const byte* page_zip_parse_compress(const byte* ptr, const byte* end_ptr,
  4143. buf_block_t* block)
  4144. {
  4145. ulint size;
  4146. ulint trailer_size;
  4147. ut_ad(ptr != NULL);
  4148. ut_ad(end_ptr!= NULL);
  4149. if (UNIV_UNLIKELY(ptr + (2 + 2) > end_ptr)) {
  4150. return(NULL);
  4151. }
  4152. size = mach_read_from_2(ptr);
  4153. ptr += 2;
  4154. trailer_size = mach_read_from_2(ptr);
  4155. ptr += 2;
  4156. if (UNIV_UNLIKELY(ptr + 8 + size + trailer_size > end_ptr)) {
  4157. return(NULL);
  4158. }
  4159. if (block) {
  4160. ut_ad(buf_block_get_state(block) == BUF_BLOCK_FILE_PAGE);
  4161. page_zip_des_t* page_zip = buf_block_get_page_zip(block);
  4162. if (!page_zip || page_zip_get_size(page_zip) < size
  4163. || block->page.id.page_no() < 3) {
  4164. corrupt:
  4165. recv_sys.found_corrupt_log = TRUE;
  4166. return(NULL);
  4167. }
  4168. memset(page_zip->data, 0, page_zip_get_size(page_zip));
  4169. mach_write_to_4(FIL_PAGE_OFFSET
  4170. + page_zip->data, block->page.id.page_no());
  4171. mach_write_to_4(FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID
  4172. + page_zip->data, block->page.id.space());
  4173. memcpy(page_zip->data + FIL_PAGE_PREV, ptr, 4);
  4174. memcpy(page_zip->data + FIL_PAGE_NEXT, ptr + 4, 4);
  4175. memcpy(page_zip->data + FIL_PAGE_TYPE, ptr + 8, size);
  4176. memset(page_zip->data + FIL_PAGE_TYPE + size, 0,
  4177. page_zip_get_size(page_zip) - trailer_size
  4178. - (FIL_PAGE_TYPE + size));
  4179. memcpy(page_zip->data + page_zip_get_size(page_zip)
  4180. - trailer_size, ptr + 8 + size, trailer_size);
  4181. if (UNIV_UNLIKELY(!page_zip_decompress(page_zip, block->frame,
  4182. TRUE))) {
  4183. goto corrupt;
  4184. }
  4185. }
  4186. return(const_cast<byte*>(ptr) + 8 + size + trailer_size);
  4187. }
  4188. #endif /* !UNIV_INNOCHECKSUM */
  4189. /** Calculate the compressed page checksum.
  4190. @param[in] data compressed page
  4191. @param[in] size size of compressed page
  4192. @param[in] algo algorithm to use
  4193. @return page checksum */
  4194. uint32_t
  4195. page_zip_calc_checksum(
  4196. const void* data,
  4197. ulint size,
  4198. srv_checksum_algorithm_t algo)
  4199. {
  4200. uLong adler;
  4201. const Bytef* s = static_cast<const byte*>(data);
  4202. /* Exclude FIL_PAGE_SPACE_OR_CHKSUM, FIL_PAGE_LSN,
  4203. and FIL_PAGE_FILE_FLUSH_LSN from the checksum. */
  4204. switch (algo) {
  4205. case SRV_CHECKSUM_ALGORITHM_FULL_CRC32:
  4206. case SRV_CHECKSUM_ALGORITHM_STRICT_FULL_CRC32:
  4207. case SRV_CHECKSUM_ALGORITHM_CRC32:
  4208. case SRV_CHECKSUM_ALGORITHM_STRICT_CRC32:
  4209. ut_ad(size > FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID);
  4210. return ut_crc32(s + FIL_PAGE_OFFSET,
  4211. FIL_PAGE_LSN - FIL_PAGE_OFFSET)
  4212. ^ ut_crc32(s + FIL_PAGE_TYPE, 2)
  4213. ^ ut_crc32(s + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID,
  4214. size - FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID);
  4215. case SRV_CHECKSUM_ALGORITHM_INNODB:
  4216. case SRV_CHECKSUM_ALGORITHM_STRICT_INNODB:
  4217. ut_ad(size > FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID);
  4218. adler = adler32(0L, s + FIL_PAGE_OFFSET,
  4219. FIL_PAGE_LSN - FIL_PAGE_OFFSET);
  4220. adler = adler32(adler, s + FIL_PAGE_TYPE, 2);
  4221. adler = adler32(
  4222. adler, s + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID,
  4223. static_cast<uInt>(size)
  4224. - FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID);
  4225. return(uint32_t(adler));
  4226. case SRV_CHECKSUM_ALGORITHM_NONE:
  4227. case SRV_CHECKSUM_ALGORITHM_STRICT_NONE:
  4228. return(BUF_NO_CHECKSUM_MAGIC);
  4229. /* no default so the compiler will emit a warning if new enum
  4230. is added and not handled here */
  4231. }
  4232. ut_error;
  4233. return(0);
  4234. }
  4235. /** Verify a compressed page's checksum.
  4236. @param[in] data compressed page
  4237. @param[in] size size of compressed page
  4238. @return whether the stored checksum is valid according to the value of
  4239. innodb_checksum_algorithm */
  4240. bool page_zip_verify_checksum(const void* data, ulint size)
  4241. {
  4242. const uint32_t stored = mach_read_from_4(
  4243. static_cast<const byte*>(data) + FIL_PAGE_SPACE_OR_CHKSUM);
  4244. compile_time_assert(!(FIL_PAGE_LSN % 8));
  4245. /* Check if page is empty */
  4246. if (stored == 0
  4247. && *reinterpret_cast<const ib_uint64_t*>(static_cast<const char*>(
  4248. data)
  4249. + FIL_PAGE_LSN) == 0) {
  4250. /* make sure that the page is really empty */
  4251. #ifdef UNIV_INNOCHECKSUM
  4252. ulint i;
  4253. for (i = 0; i < size; i++) {
  4254. if (*((const char*) data + i) != 0)
  4255. break;
  4256. }
  4257. if (i >= size) {
  4258. if (log_file) {
  4259. fprintf(log_file, "Page::%llu is empty and"
  4260. " uncorrupted\n", cur_page_num);
  4261. }
  4262. return(TRUE);
  4263. }
  4264. #else
  4265. for (ulint i = 0; i < size; i++) {
  4266. if (*((const char*) data + i) != 0) {
  4267. return(FALSE);
  4268. }
  4269. }
  4270. /* Empty page */
  4271. return(TRUE);
  4272. #endif /* UNIV_INNOCHECKSUM */
  4273. }
  4274. const srv_checksum_algorithm_t curr_algo =
  4275. static_cast<srv_checksum_algorithm_t>(srv_checksum_algorithm);
  4276. if (curr_algo == SRV_CHECKSUM_ALGORITHM_NONE) {
  4277. return(TRUE);
  4278. }
  4279. uint32_t calc = page_zip_calc_checksum(data, size, curr_algo);
  4280. #ifdef UNIV_INNOCHECKSUM
  4281. if (log_file) {
  4282. fprintf(log_file, "page::%llu;"
  4283. " %s checksum: calculated = %u;"
  4284. " recorded = %u\n", cur_page_num,
  4285. buf_checksum_algorithm_name(
  4286. static_cast<srv_checksum_algorithm_t>(
  4287. srv_checksum_algorithm)),
  4288. calc, stored);
  4289. }
  4290. if (!strict_verify) {
  4291. const uint32_t crc32 = page_zip_calc_checksum(
  4292. data, size, SRV_CHECKSUM_ALGORITHM_CRC32);
  4293. if (log_file) {
  4294. fprintf(log_file, "page::%llu: crc32 checksum:"
  4295. " calculated = %u; recorded = %u\n",
  4296. cur_page_num, crc32, stored);
  4297. fprintf(log_file, "page::%llu: none checksum:"
  4298. " calculated = %lu; recorded = %u\n",
  4299. cur_page_num, BUF_NO_CHECKSUM_MAGIC, stored);
  4300. }
  4301. }
  4302. #endif /* UNIV_INNOCHECKSUM */
  4303. if (stored == calc) {
  4304. return(TRUE);
  4305. }
  4306. switch (curr_algo) {
  4307. case SRV_CHECKSUM_ALGORITHM_STRICT_FULL_CRC32:
  4308. case SRV_CHECKSUM_ALGORITHM_STRICT_CRC32:
  4309. case SRV_CHECKSUM_ALGORITHM_STRICT_INNODB:
  4310. case SRV_CHECKSUM_ALGORITHM_STRICT_NONE:
  4311. return FALSE;
  4312. case SRV_CHECKSUM_ALGORITHM_FULL_CRC32:
  4313. case SRV_CHECKSUM_ALGORITHM_CRC32:
  4314. if (stored == BUF_NO_CHECKSUM_MAGIC) {
  4315. return(TRUE);
  4316. }
  4317. return stored == page_zip_calc_checksum(
  4318. data, size, SRV_CHECKSUM_ALGORITHM_INNODB);
  4319. case SRV_CHECKSUM_ALGORITHM_INNODB:
  4320. if (stored == BUF_NO_CHECKSUM_MAGIC) {
  4321. return TRUE;
  4322. }
  4323. return stored == page_zip_calc_checksum(
  4324. data, size, SRV_CHECKSUM_ALGORITHM_CRC32);
  4325. case SRV_CHECKSUM_ALGORITHM_NONE:
  4326. return TRUE;
  4327. }
  4328. return FALSE;
  4329. }