You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

3842 lines
104 KiB

10 years ago
MDEV-13637 InnoDB change buffer housekeeping can cause redo log overrun and possibly deadlocks The function ibuf_remove_free_page() may be called while the caller is holding several mutexes or rw-locks. Because of this, this housekeeping loop may cause performance glitches for operations that involve tables that are stored in the InnoDB system tablespace. Also deadlocks might be possible. The worst impact of all is that due to the mutexes being held, calls to log_free_check() had to be skipped during this housekeeping. This means that the cyclic InnoDB redo log may be overwritten. If the system crashes during this, it would be unable to recover. The entry point to the problematic code is ibuf_free_excess_pages(). It would make sense to call it before acquiring any mutexes or rw-locks, in any 'pessimistic' operation that involves the system tablespace. fseg_create_general(), fseg_alloc_free_page_general(): Do not call ibuf_free_excess_pages() while potentially holding some latches. ibuf_remove_free_page(): Do call log_free_check(), like every operation that is about to generate redo log should do. ibuf_free_excess_pages(): Remove some assertions that are replaced by stricter assertions in the log_free_check() that is now called by ibuf_remove_free_page(). row_mtr_start(): New function, to perform necessary preparations when starting a mini-transaction for row operations. For pessimistic operations on secondary indexes that are located in the system tablespace, this includes calling ibuf_free_excess_pages(). row_undo_ins_remove_sec_low(), row_undo_mod_del_mark_or_remove_sec_low(), row_undo_mod_del_unmark_sec_and_undo_update(): Call row_mtr_start(). row_ins_sec_index_entry(): Call ibuf_free_excess_pages() if the operation may involve allocating pages and change buffering in the system tablespace. row_upd_sec_index_entry(): Slightly refactor the code. The delete-marking of the old entry is done in-place. It could be change-buffered, but the old code should be unlikely to have invoked ibuf_free_excess_pages() in this case.
8 years ago
7 years ago
12 years ago
MDEV-14407 Assertion failure during rollback Rollback attempted to dereference DB_ROLL_PTR=0, which cannot possibly be a valid undo log pointer. A safer canonical value would be roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS which is what was chosen in MDEV-12288, corresponding to reset_trx_id. No deterministic test case for the bug was found. The simplest test cases may be related to MDEV-11415, which suppresses undo logging for ALGORITHM=COPY operations. In those operations, in the spirit of MDEV-12288, we should actually have written reset_trx_id instead of using the transaction identifier of the current transaction (and a bogus value of DB_ROLL_PTR=0). However, thanks to MySQL Bug#28432 which I had fixed in MySQL 5.6.8 as part of WL#6255, access to the rebuilt table by earlier-started transactions should actually have been refused with ER_TABLE_DEF_CHANGED. reset_trx_id: Move the definition to data0type.cc and the declaration to data0type.h. btr_cur_ins_lock_and_undo(): When undo logging is disabled, use the safe value that corresponds to reset_trx_id. btr_cur_optimistic_insert(): Validate the DB_TRX_ID,DB_ROLL_PTR before inserting into a clustered index leaf page. ins_node_t::sys_buf[]: Replaces row_id_buf and trx_id_buf and some heap usage. row_ins_alloc_sys_fields(): Init ins_node_t::sys_buf[] to reset_trx_id. row_ins_buf(): Only if undo logging is enabled, copy trx->id to node->sys_buf. Otherwise, rely on the initialization in row_ins_alloc_sys_fields(). row_purge_reset_trx_id(): Invoke mlog_write_string() with reset_trx_id directly. (No functional change.) trx_undo_page_report_modify(): Assert that the DB_ROLL_PTR is not 0. trx_undo_get_undo_rec_low(): Assert that the roll_ptr is valid before trying to dereference it. dict_index_t::is_primary(): Check if the index is the primary key. PageConverter::adjust_cluster_record(): Fix MDEV-15249 Crash in MVCC read after IMPORT TABLESPACE by resetting the system fields to reset_trx_id instead of writing the current transaction ID (which will be committed at the end of the IMPORT TABLESPACE) and DB_ROLL_PTR=0. This can partially be viewed as a follow-up fix of MDEV-12288, because IMPORT should already then have written DB_TRX_ID=0 and DB_ROLL_PTR=1<<55 to prevent unnecessary DB_TRX_ID lookups in subsequent accesses to the table.
8 years ago
12 years ago
MDEV-14407 Assertion failure during rollback Rollback attempted to dereference DB_ROLL_PTR=0, which cannot possibly be a valid undo log pointer. A safer canonical value would be roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS which is what was chosen in MDEV-12288, corresponding to reset_trx_id. No deterministic test case for the bug was found. The simplest test cases may be related to MDEV-11415, which suppresses undo logging for ALGORITHM=COPY operations. In those operations, in the spirit of MDEV-12288, we should actually have written reset_trx_id instead of using the transaction identifier of the current transaction (and a bogus value of DB_ROLL_PTR=0). However, thanks to MySQL Bug#28432 which I had fixed in MySQL 5.6.8 as part of WL#6255, access to the rebuilt table by earlier-started transactions should actually have been refused with ER_TABLE_DEF_CHANGED. reset_trx_id: Move the definition to data0type.cc and the declaration to data0type.h. btr_cur_ins_lock_and_undo(): When undo logging is disabled, use the safe value that corresponds to reset_trx_id. btr_cur_optimistic_insert(): Validate the DB_TRX_ID,DB_ROLL_PTR before inserting into a clustered index leaf page. ins_node_t::sys_buf[]: Replaces row_id_buf and trx_id_buf and some heap usage. row_ins_alloc_sys_fields(): Init ins_node_t::sys_buf[] to reset_trx_id. row_ins_buf(): Only if undo logging is enabled, copy trx->id to node->sys_buf. Otherwise, rely on the initialization in row_ins_alloc_sys_fields(). row_purge_reset_trx_id(): Invoke mlog_write_string() with reset_trx_id directly. (No functional change.) trx_undo_page_report_modify(): Assert that the DB_ROLL_PTR is not 0. trx_undo_get_undo_rec_low(): Assert that the roll_ptr is valid before trying to dereference it. dict_index_t::is_primary(): Check if the index is the primary key. PageConverter::adjust_cluster_record(): Fix MDEV-15249 Crash in MVCC read after IMPORT TABLESPACE by resetting the system fields to reset_trx_id instead of writing the current transaction ID (which will be committed at the end of the IMPORT TABLESPACE) and DB_ROLL_PTR=0. This can partially be viewed as a follow-up fix of MDEV-12288, because IMPORT should already then have written DB_TRX_ID=0 and DB_ROLL_PTR=1<<55 to prevent unnecessary DB_TRX_ID lookups in subsequent accesses to the table.
8 years ago
12 years ago
MDEV-14407 Assertion failure during rollback Rollback attempted to dereference DB_ROLL_PTR=0, which cannot possibly be a valid undo log pointer. A safer canonical value would be roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS which is what was chosen in MDEV-12288, corresponding to reset_trx_id. No deterministic test case for the bug was found. The simplest test cases may be related to MDEV-11415, which suppresses undo logging for ALGORITHM=COPY operations. In those operations, in the spirit of MDEV-12288, we should actually have written reset_trx_id instead of using the transaction identifier of the current transaction (and a bogus value of DB_ROLL_PTR=0). However, thanks to MySQL Bug#28432 which I had fixed in MySQL 5.6.8 as part of WL#6255, access to the rebuilt table by earlier-started transactions should actually have been refused with ER_TABLE_DEF_CHANGED. reset_trx_id: Move the definition to data0type.cc and the declaration to data0type.h. btr_cur_ins_lock_and_undo(): When undo logging is disabled, use the safe value that corresponds to reset_trx_id. btr_cur_optimistic_insert(): Validate the DB_TRX_ID,DB_ROLL_PTR before inserting into a clustered index leaf page. ins_node_t::sys_buf[]: Replaces row_id_buf and trx_id_buf and some heap usage. row_ins_alloc_sys_fields(): Init ins_node_t::sys_buf[] to reset_trx_id. row_ins_buf(): Only if undo logging is enabled, copy trx->id to node->sys_buf. Otherwise, rely on the initialization in row_ins_alloc_sys_fields(). row_purge_reset_trx_id(): Invoke mlog_write_string() with reset_trx_id directly. (No functional change.) trx_undo_page_report_modify(): Assert that the DB_ROLL_PTR is not 0. trx_undo_get_undo_rec_low(): Assert that the roll_ptr is valid before trying to dereference it. dict_index_t::is_primary(): Check if the index is the primary key. PageConverter::adjust_cluster_record(): Fix MDEV-15249 Crash in MVCC read after IMPORT TABLESPACE by resetting the system fields to reset_trx_id instead of writing the current transaction ID (which will be committed at the end of the IMPORT TABLESPACE) and DB_ROLL_PTR=0. This can partially be viewed as a follow-up fix of MDEV-12288, because IMPORT should already then have written DB_TRX_ID=0 and DB_ROLL_PTR=1<<55 to prevent unnecessary DB_TRX_ID lookups in subsequent accesses to the table.
8 years ago
12 years ago
MDEV-14407 Assertion failure during rollback Rollback attempted to dereference DB_ROLL_PTR=0, which cannot possibly be a valid undo log pointer. A safer canonical value would be roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS which is what was chosen in MDEV-12288, corresponding to reset_trx_id. No deterministic test case for the bug was found. The simplest test cases may be related to MDEV-11415, which suppresses undo logging for ALGORITHM=COPY operations. In those operations, in the spirit of MDEV-12288, we should actually have written reset_trx_id instead of using the transaction identifier of the current transaction (and a bogus value of DB_ROLL_PTR=0). However, thanks to MySQL Bug#28432 which I had fixed in MySQL 5.6.8 as part of WL#6255, access to the rebuilt table by earlier-started transactions should actually have been refused with ER_TABLE_DEF_CHANGED. reset_trx_id: Move the definition to data0type.cc and the declaration to data0type.h. btr_cur_ins_lock_and_undo(): When undo logging is disabled, use the safe value that corresponds to reset_trx_id. btr_cur_optimistic_insert(): Validate the DB_TRX_ID,DB_ROLL_PTR before inserting into a clustered index leaf page. ins_node_t::sys_buf[]: Replaces row_id_buf and trx_id_buf and some heap usage. row_ins_alloc_sys_fields(): Init ins_node_t::sys_buf[] to reset_trx_id. row_ins_buf(): Only if undo logging is enabled, copy trx->id to node->sys_buf. Otherwise, rely on the initialization in row_ins_alloc_sys_fields(). row_purge_reset_trx_id(): Invoke mlog_write_string() with reset_trx_id directly. (No functional change.) trx_undo_page_report_modify(): Assert that the DB_ROLL_PTR is not 0. trx_undo_get_undo_rec_low(): Assert that the roll_ptr is valid before trying to dereference it. dict_index_t::is_primary(): Check if the index is the primary key. PageConverter::adjust_cluster_record(): Fix MDEV-15249 Crash in MVCC read after IMPORT TABLESPACE by resetting the system fields to reset_trx_id instead of writing the current transaction ID (which will be committed at the end of the IMPORT TABLESPACE) and DB_ROLL_PTR=0. This can partially be viewed as a follow-up fix of MDEV-12288, because IMPORT should already then have written DB_TRX_ID=0 and DB_ROLL_PTR=1<<55 to prevent unnecessary DB_TRX_ID lookups in subsequent accesses to the table.
8 years ago
10 years ago
10 years ago
10 years ago
10 years ago
MDEV-13485 MTR tests fail massively with --innodb-sync-debug The parameter --innodb-sync-debug, which is disabled by default, aims to find potential deadlocks in InnoDB. When the parameter is enabled, lots of tests failed. Most of these failures were due to bogus diagnostics. But, as part of this fix, we are also fixing a bug in error handling code and removing dead code, and fixing cases where an uninitialized mutex was being locked and unlocked. dict_create_foreign_constraints_low(): Remove an extraneous mutex_exit() call that could cause corruption in an error handling path. Also, do not unnecessarily acquire dict_foreign_err_mutex. Its only purpose is to control concurrent access to dict_foreign_err_file. row_ins_foreign_trx_print(): Replace a redundant condition with a debug assertion. srv_dict_tmpfile, srv_dict_tmpfile_mutex: Remove. The temporary file is never being written to or read from. log_free_check(): Allow SYNC_FTS_CACHE (fts_cache_t::lock) to be held. ha_innobase::inplace_alter_table(), row_merge_insert_index_tuples(): Assert that no unexpected latches are being held. sync_latch_meta_init(): Properly initialize dict_operation_lock_key at SYNC_DICT_OPERATION. dict_sys->mutex is SYNC_DICT, and the now-removed SRV_DICT_TMPFILE was wrongly registered at SYNC_DICT_OPERATION. buf_block_init(): Correctly register buf_block_t::debug_latch. It was previously misleadingly reported as LATCH_ID_DICT_FOREIGN_ERR. latch_level_t: Correct the relative latching order of SYNC_IBUF_PESS_INSERT_MUTEX,SYNC_INDEX_TREE and SYNC_FILE_FORMAT_TAG,SYNC_DICT_OPERATION to avoid bogus failures. row_drop_table_for_mysql(): Avoid accessing btr_defragment_mutex if the defragmentation thread has not been started. This is the case during fts_drop_orphaned_tables() in recv_recovery_rollback_active(). fil_space_destroy_crypt_data(): Avoid acquiring fil_crypt_threads_mutex when it is uninitialized. We may have created crypt_data before the mutex was created, and the mutex creation would be skipped if InnoDB startup failed or --innodb-read-only was specified.
8 years ago
7 years ago
10 years ago
MDEV-20377: Make WITH_MSAN more usable MemorySanitizer (clang -fsanitize=memory) requires that all code be compiled with instrumentation enabled. The only exception is the C runtime library. Failure to use instrumented libraries will cause bogus messages about memory being uninitialized. In WITH_MSAN builds, we must avoid calling getservbyname(), because even though it is a standard library function, it is not instrumented, not even in clang 10. Note: Before MariaDB Server 10.5, ./mtr will typically fail due to the old PCRE library, which was updated in MDEV-14024. The following cmake options were tested on 10.5 in commit 94d0bb4dbeb28a94d1f87fdd55f4297ff3df0157: cmake \ -DCMAKE_C_FLAGS='-march=native -O2' \ -DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \ -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \ -DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \ -DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \ -DWITH_SAFEMALLOC=OFF \ -DWITH_{ZLIB,SSL,PCRE}=bundled \ -DHAVE_LIBAIO_H=0 \ -DWITH_MSAN=ON MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED() and __msan_unpoison(). MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow(). InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros. ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in functions instead of inline assembler when building WITH_MSAN. This will require at least -msse4.2 when building for IA-32 or AMD64. The inline assembler would not be instrumented, and would thus cause bogus failures.
5 years ago
MDEV-23557 Galera heap-buffer-overflow in wsrep_rec_get_foreign_key This commit contains a fix and extended test case for a ASAN failure reported during galera.fk mtr testing. The reported heap buffer overflow happens in test case where a cascading foreign key constraint is defined for a column of varchar type, and galera.fk.test has such vulnerable test scenario. Troubleshoting revealed that erlier fix for MDEV-19660 has made a fix for cascading delete handling to append wsrep keys from pcur->old_rec, in row_ins_foreign_check_on_constraint(). And, the ASAN failuer comes from later scanning of this old_rec reference. The fix in this commit, moves the call for wsrep_append_foreign_key() to happen somewhat earlier, and inside ongoing mtr, and using clust_rec which is set earlier in the same mtr for both update and delete cascade operations. for wsrep key populating, it does not matter when the keys are populated, all keys just have to be appended before wsrep transaction replicates. Note that I also tried similar fix for earlier wsrep key append, but using the old implementation with pcur->old_rec (instead of clust_rec), and same ASAN failure was reported. So it appears that pcur->old_rec is not properly set, to be used for wsrep key appending. galera.galera_fk_cascade_delete test has been extended by two new test scenarios: * FK cascade on varchar column. This test case reproduces same scenario as galera.fk, and this test scenario will also trigger ASAN failure with non fixed MariaDB versions. * multi-master conflict with FK cascading. this scenario causes a conflict between a replicated FK cascading transaction and local transaction trying to modify the cascaded child table row. Local transaction should be aborted and get deadlock error. This test scenario is passing both with old MariaDB version and with this commit as well.
5 years ago
MDEV-23557 Galera heap-buffer-overflow in wsrep_rec_get_foreign_key This commit contains a fix and extended test case for a ASAN failure reported during galera.fk mtr testing. The reported heap buffer overflow happens in test case where a cascading foreign key constraint is defined for a column of varchar type, and galera.fk.test has such vulnerable test scenario. Troubleshoting revealed that erlier fix for MDEV-19660 has made a fix for cascading delete handling to append wsrep keys from pcur->old_rec, in row_ins_foreign_check_on_constraint(). And, the ASAN failuer comes from later scanning of this old_rec reference. The fix in this commit, moves the call for wsrep_append_foreign_key() to happen somewhat earlier, and inside ongoing mtr, and using clust_rec which is set earlier in the same mtr for both update and delete cascade operations. for wsrep key populating, it does not matter when the keys are populated, all keys just have to be appended before wsrep transaction replicates. Note that I also tried similar fix for earlier wsrep key append, but using the old implementation with pcur->old_rec (instead of clust_rec), and same ASAN failure was reported. So it appears that pcur->old_rec is not properly set, to be used for wsrep key appending. galera.galera_fk_cascade_delete test has been extended by two new test scenarios: * FK cascade on varchar column. This test case reproduces same scenario as galera.fk, and this test scenario will also trigger ASAN failure with non fixed MariaDB versions. * multi-master conflict with FK cascading. this scenario causes a conflict between a replicated FK cascading transaction and local transaction trying to modify the cascaded child table row. Local transaction should be aborted and get deadlock error. This test scenario is passing both with old MariaDB version and with this commit as well.
5 years ago
MDEV-13564 Mariabackup does not work with TRUNCATE Implement undo tablespace truncation via normal redo logging. Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name, CREATE, and DROP. Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2 is killed before the DROP operation is committed. If MariaDB Server 10.2 is killed during TRUNCATE, it is also possible that the old table was renamed to #sql-ib*.ibd but the data dictionary will refer to the table using the original name. In MariaDB Server 10.3, RENAME inside InnoDB is transactional, and #sql-* tables will be dropped on startup. So, this new TRUNCATE will be fully crash-safe in 10.3. ha_mroonga::wrapper_truncate(): Pass table options to the underlying storage engine, now that ha_innobase::truncate() will need them. rpl_slave_state::truncate_state_table(): Before truncating mysql.gtid_slave_pos, evict any cached table handles from the table definition cache, so that there will be no stale references to the old table after truncating. == TRUNCATE TABLE == WL#6501 in MySQL 5.7 introduced separate log files for implementing atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB undo and redo log. Some convoluted logic was added to the InnoDB crash recovery, and some extra synchronization (including a redo log checkpoint) was introduced to make this work. This synchronization has caused performance problems and race conditions, and the extra log files cannot be copied or applied by external backup programs. In order to support crash-upgrade from MariaDB 10.2, we will keep the logic for parsing and applying the extra log files, but we will no longer generate those files in TRUNCATE TABLE. A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE (with full redo and undo logging and proper rollback). This will be implemented in MDEV-14717. ha_innobase::truncate(): Invoke RENAME, create(), delete_table(). Because RENAME cannot be fully rolled back before MariaDB 10.3 due to missing undo logging, add some explicit rename-back in case the operation fails. ha_innobase::delete(): Introduce a variant that takes sqlcom as a parameter. In TRUNCATE TABLE, we do not want to touch any FOREIGN KEY constraints. ha_innobase::create(): Add the parameters file_per_table, trx. In TRUNCATE, the new table must be created in the same transaction that renames the old table. create_table_info_t::create_table_info_t(): Add the parameters file_per_table, trx. row_drop_table_for_mysql(): Replace a bool parameter with sqlcom. row_drop_table_after_create_fail(): New function, wrapping row_drop_table_for_mysql(). dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(), fil_prepare_for_truncate(), fil_reinit_space_header_for_table(), row_truncate_table_for_mysql(), TruncateLogger, row_truncate_prepare(), row_truncate_rollback(), row_truncate_complete(), row_truncate_fts(), row_truncate_update_system_tables(), row_truncate_foreign_key_checks(), row_truncate_sanity_checks(): Remove. row_upd_check_references_constraints(): Remove a check for TRUNCATE, now that the table is no longer truncated in place. The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some race-condition like scenarios. The test innodb-innodb.truncate does not use any synchronization. We add a redo log subformat to indicate backup-friendly format. MariaDB 10.4 will remove support for the old TRUNCATE logging, so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve limitations. == Undo tablespace truncation == MySQL 5.7 implements undo tablespace truncation. It is only possible when innodb_undo_tablespaces is set to at least 2. The logging is implemented similar to the WL#6501 TRUNCATE, that is, using separate log files and a redo log checkpoint. We can simply implement undo tablespace truncation within a single mini-transaction that reinitializes the undo log tablespace file. Unfortunately, due to the redo log format of some operations, currently, the total redo log written by undo tablespace truncation will be more than the combined size of the truncated undo tablespace. It should be acceptable to have a little more than 1 megabyte of log in a single mini-transaction. This will be fixed in MDEV-17138 in MariaDB Server 10.4. recv_sys_t: Add truncated_undo_spaces[] to remember for which undo tablespaces a MLOG_FILE_CREATE2 record was seen. namespace undo: Remove some unnecessary declarations. fil_space_t::is_being_truncated: Document that this flag now only applies to undo tablespaces. Remove some references. fil_space_t::is_stopping(): Do not refer to is_being_truncated. This check is for tablespaces of tables. Potentially used tablespaces are never truncated any more. buf_dblwr_process(): Suppress the out-of-bounds warning for undo tablespaces. fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero page number (new size of the tablespace in pages) to inform crash recovery that the undo tablespace size has been reduced. fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2 can be written for undo tablespaces (without .ibd file suffix) for a nonzero page number. os_file_truncate(): Add the parameter allow_shrink=false so that undo tablespaces can actually be shrunk using this function. fil_name_parse(): For undo tablespace truncation, buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[]. recv_read_in_area(): Avoid reading pages for which no redo log records remain buffered, after recv_addr_trim() removed them. trx_rseg_header_create(): Add a FIXME comment that we could write much less redo log. trx_undo_truncate_tablespace(): Reinitialize the undo tablespace in a single mini-transaction, which will be flushed to the redo log before the file size is trimmed. recv_addr_trim(): Discard any redo logs for pages that were logged after the new end of a file, before the truncation LSN. If the rec_list becomes empty, reduce n_addrs. After removing any affected records, actually truncate the file. recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before applying any log records. The undo tablespace files must be open at this point. buf_flush_or_remove_pages(), buf_flush_dirty_pages(), buf_LRU_flush_or_remove_pages(): Add a parameter for specifying the number of the first page to flush or remove (default 0). trx_purge_initiate_truncate(): Remove the log checkpoints, the extra logging, and some unnecessary crash points. Merge the code from trx_undo_truncate_tablespace(). First, flush all to-be-discarded pages (beyond the new end of the file), then trim the space->size to make the page allocation deterministic. At the only remaining crash injection point, flush the redo log, so that the recovery can be tested.
7 years ago
7 years ago
10 years ago
MDEV-20812 Unexpected ER_ROW_IS_REFERENCED_2 upon DELETE from versioned table with FK MDEV-16210 original case was wrongly allowed versioned DELETE from referenced table where reference is by non-primary key. InnoDB UPDATE has optimization for new rows not changing its clustered index position. In this case InnoDB doesn't update all secondary indexes and misses the one holding the referenced key. The fix was to disable this optimization for versioned DELETE. In case of versioned DELETE we forcely update all secondary indexes and therefore check them for constraints. But the above fix raised another problem with versioned DELETE on foreign table side. In case when there was no corresponding record in referenced table (illegal foreign reference can be done with "set foreign_key_checks=off") there was spurious constraint check (because versioned DELETE is actually UPDATE) and hence the operation failed with constraint error. MDEV-16210 tried to fix the above problem by checking foreign table instead of referenced table and that at least was illegal. Constraint check is done by row_ins_check_foreign_constraint() no matter what kind of table is checked, referenced or foreign (controlled by check_ref argument). Referenced table is checked by row_upd_check_references_constraints(). Foreign table is checked by row_ins_check_foreign_constraints(). Current fix rolls back the wrong fix for the above problem and disables referenced table check for DELETE on foreign side by introducing `check_foreign` argument which when set to *false* skips row_ins_check_foreign_constraints() call.
6 years ago
11 years ago
11 years ago
10 years ago
MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
6 years ago
11 years ago
10 years ago
10 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-12358 Work around what looks like a bug in GCC 7.1.0 The parameter thr of the function btr_cur_optimistic_insert() is not declared as nonnull, but GCC 7.1.0 with -O3 is wrongly optimizing away the first part of the condition UNIV_UNLIKELY(thr && thr_get_trx(thr)->fake_changes) when the function is being called by row_merge_insert_index_tuples() with thr==NULL. The fake_changes is an XtraDB addition. This GCC bug only appears to have an impact on XtraDB, not InnoDB. We work around the problem by not attempting to dereference thr when both BTR_NO_LOCKING_FLAG and BTR_NO_UNDO_LOG_FLAG are set in the flags. Probably BTR_NO_LOCKING_FLAG alone should suffice. btr_cur_optimistic_insert(), btr_cur_pessimistic_insert(), btr_cur_pessimistic_update(): Correct comments that disagree with usage and with nonnull attributes. No other parameter than thr can actually be NULL. row_ins_duplicate_error_in_clust(): Remove an unused parameter. innobase_is_fake_change(): Unused function; remove. ibuf_insert_low(), row_log_table_apply(), row_log_apply(), row_undo_mod_clust_low(): Because we will be passing BTR_NO_LOCKING_FLAG | BTR_NO_UNDO_LOG_FLAG in the flags, the trx->fake_changes flag will be treated as false, which is the right thing to do at these low-level operations (change buffer merge, ALTER TABLE…LOCK=NONE, or ROLLBACK). This might be fixing actual XtraDB bugs. Other callers that pass these two flags are also passing thr=NULL, implying fake_changes=false. (Some callers in ROLLBACK are passing BTR_NO_LOCKING_FLAG and a nonnull thr. In these callers, fake_changes better be false, to avoid corruption.)
9 years ago
10 years ago
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with the notable difference that the file format changes are limited to repurposing a previously unused data field in B-tree pages. For persistent InnoDB tables, write the last used AUTO_INCREMENT value to the root page of the clustered index, in the previously unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC. Unlike some other previously unused InnoDB data fields, this one was actually always zero-initialized, at least since MySQL 3.23.49. The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the root page. The SX latch will allow concurrent read access to the root page. (The field PAGE_ROOT_AUTO_INC will only be read on the first-time call to ha_innobase::open() from the SQL layer. The PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so read/write races are not possible.) During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level function btr_cur_search_to_nth_level(), adding no extra page access. [Adaptive hash index lookup will be disabled during INSERT.] If some rare UPDATE modifies an AUTO_INCREMENT column, the PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in ha_innobase::update_row(). When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC field. During ALTER TABLE, the initial AUTO_INCREMENT value will be copied from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will update PAGE_ROOT_AUTO_INC in real time. innodb_col_no(): Determine the dict_table_t::cols[] element index corresponding to a Field of a non-virtual column. (The MySQL 5.7 implementation of virtual columns breaks the 1:1 relationship between Field::field_index and dict_table_t::cols[]. Virtual columns are omitted from dict_table_t::cols[]. Therefore, we must translate the field_index of AUTO_INCREMENT columns into an index of dict_table_t::cols[].) Upgrade from old data files: By default, the AUTO_INCREMENT sequence in old data files would appear to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain the value 0 in each clustered index page. In new data files, PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain any AUTO_INCREMENT column. For backward compatibility, we use the old method of SELECT MAX(auto_increment_column) for initializing the sequence. btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format data file. btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc() that will resort to reading MAX(auto_increment_column) for data files that did not use AUTO_INCREMENT yet. It was manually tested that during the execution of innodb.autoinc_persist the compatibility logic is not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty clustered index root pages). initialize_auto_increment(): Replaces ha_innobase::innobase_initialize_autoinc(). This initializes the AUTO_INCREMENT metadata. Only called from ha_innobase::open(). ha_innobase::info_low(): Do not try to lazily initialize dict_table_t::autoinc. It must already have been initialized by ha_innobase::open() or ha_innobase::create(). Note: The adjustments to class ha_innopart were not tested, because the source code (native InnoDB partitioning) is not being compiled.
9 years ago
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with the notable difference that the file format changes are limited to repurposing a previously unused data field in B-tree pages. For persistent InnoDB tables, write the last used AUTO_INCREMENT value to the root page of the clustered index, in the previously unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC. Unlike some other previously unused InnoDB data fields, this one was actually always zero-initialized, at least since MySQL 3.23.49. The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the root page. The SX latch will allow concurrent read access to the root page. (The field PAGE_ROOT_AUTO_INC will only be read on the first-time call to ha_innobase::open() from the SQL layer. The PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so read/write races are not possible.) During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level function btr_cur_search_to_nth_level(), adding no extra page access. [Adaptive hash index lookup will be disabled during INSERT.] If some rare UPDATE modifies an AUTO_INCREMENT column, the PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in ha_innobase::update_row(). When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC field. During ALTER TABLE, the initial AUTO_INCREMENT value will be copied from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will update PAGE_ROOT_AUTO_INC in real time. innodb_col_no(): Determine the dict_table_t::cols[] element index corresponding to a Field of a non-virtual column. (The MySQL 5.7 implementation of virtual columns breaks the 1:1 relationship between Field::field_index and dict_table_t::cols[]. Virtual columns are omitted from dict_table_t::cols[]. Therefore, we must translate the field_index of AUTO_INCREMENT columns into an index of dict_table_t::cols[].) Upgrade from old data files: By default, the AUTO_INCREMENT sequence in old data files would appear to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain the value 0 in each clustered index page. In new data files, PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain any AUTO_INCREMENT column. For backward compatibility, we use the old method of SELECT MAX(auto_increment_column) for initializing the sequence. btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format data file. btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc() that will resort to reading MAX(auto_increment_column) for data files that did not use AUTO_INCREMENT yet. It was manually tested that during the execution of innodb.autoinc_persist the compatibility logic is not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty clustered index root pages). initialize_auto_increment(): Replaces ha_innobase::innobase_initialize_autoinc(). This initializes the AUTO_INCREMENT metadata. Only called from ha_innobase::open(). ha_innobase::info_low(): Do not try to lazily initialize dict_table_t::autoinc. It must already have been initialized by ha_innobase::open() or ha_innobase::create(). Note: The adjustments to class ha_innopart were not tested, because the source code (native InnoDB partitioning) is not being compiled.
9 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with the notable difference that the file format changes are limited to repurposing a previously unused data field in B-tree pages. For persistent InnoDB tables, write the last used AUTO_INCREMENT value to the root page of the clustered index, in the previously unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC. Unlike some other previously unused InnoDB data fields, this one was actually always zero-initialized, at least since MySQL 3.23.49. The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the root page. The SX latch will allow concurrent read access to the root page. (The field PAGE_ROOT_AUTO_INC will only be read on the first-time call to ha_innobase::open() from the SQL layer. The PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so read/write races are not possible.) During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level function btr_cur_search_to_nth_level(), adding no extra page access. [Adaptive hash index lookup will be disabled during INSERT.] If some rare UPDATE modifies an AUTO_INCREMENT column, the PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in ha_innobase::update_row(). When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC field. During ALTER TABLE, the initial AUTO_INCREMENT value will be copied from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will update PAGE_ROOT_AUTO_INC in real time. innodb_col_no(): Determine the dict_table_t::cols[] element index corresponding to a Field of a non-virtual column. (The MySQL 5.7 implementation of virtual columns breaks the 1:1 relationship between Field::field_index and dict_table_t::cols[]. Virtual columns are omitted from dict_table_t::cols[]. Therefore, we must translate the field_index of AUTO_INCREMENT columns into an index of dict_table_t::cols[].) Upgrade from old data files: By default, the AUTO_INCREMENT sequence in old data files would appear to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain the value 0 in each clustered index page. In new data files, PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain any AUTO_INCREMENT column. For backward compatibility, we use the old method of SELECT MAX(auto_increment_column) for initializing the sequence. btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format data file. btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc() that will resort to reading MAX(auto_increment_column) for data files that did not use AUTO_INCREMENT yet. It was manually tested that during the execution of innodb.autoinc_persist the compatibility logic is not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty clustered index root pages). initialize_auto_increment(): Replaces ha_innobase::innobase_initialize_autoinc(). This initializes the AUTO_INCREMENT metadata. Only called from ha_innobase::open(). ha_innobase::info_low(): Do not try to lazily initialize dict_table_t::autoinc. It must already have been initialized by ha_innobase::open() or ha_innobase::create(). Note: The adjustments to class ha_innopart were not tested, because the source code (native InnoDB partitioning) is not being compiled.
9 years ago
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with the notable difference that the file format changes are limited to repurposing a previously unused data field in B-tree pages. For persistent InnoDB tables, write the last used AUTO_INCREMENT value to the root page of the clustered index, in the previously unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC. Unlike some other previously unused InnoDB data fields, this one was actually always zero-initialized, at least since MySQL 3.23.49. The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the root page. The SX latch will allow concurrent read access to the root page. (The field PAGE_ROOT_AUTO_INC will only be read on the first-time call to ha_innobase::open() from the SQL layer. The PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so read/write races are not possible.) During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level function btr_cur_search_to_nth_level(), adding no extra page access. [Adaptive hash index lookup will be disabled during INSERT.] If some rare UPDATE modifies an AUTO_INCREMENT column, the PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in ha_innobase::update_row(). When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC field. During ALTER TABLE, the initial AUTO_INCREMENT value will be copied from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will update PAGE_ROOT_AUTO_INC in real time. innodb_col_no(): Determine the dict_table_t::cols[] element index corresponding to a Field of a non-virtual column. (The MySQL 5.7 implementation of virtual columns breaks the 1:1 relationship between Field::field_index and dict_table_t::cols[]. Virtual columns are omitted from dict_table_t::cols[]. Therefore, we must translate the field_index of AUTO_INCREMENT columns into an index of dict_table_t::cols[].) Upgrade from old data files: By default, the AUTO_INCREMENT sequence in old data files would appear to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain the value 0 in each clustered index page. In new data files, PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain any AUTO_INCREMENT column. For backward compatibility, we use the old method of SELECT MAX(auto_increment_column) for initializing the sequence. btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format data file. btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc() that will resort to reading MAX(auto_increment_column) for data files that did not use AUTO_INCREMENT yet. It was manually tested that during the execution of innodb.autoinc_persist the compatibility logic is not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty clustered index root pages). initialize_auto_increment(): Replaces ha_innobase::innobase_initialize_autoinc(). This initializes the AUTO_INCREMENT metadata. Only called from ha_innobase::open(). ha_innobase::info_low(): Do not try to lazily initialize dict_table_t::autoinc. It must already have been initialized by ha_innobase::open() or ha_innobase::create(). Note: The adjustments to class ha_innopart were not tested, because the source code (native InnoDB partitioning) is not being compiled.
9 years ago
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with the notable difference that the file format changes are limited to repurposing a previously unused data field in B-tree pages. For persistent InnoDB tables, write the last used AUTO_INCREMENT value to the root page of the clustered index, in the previously unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC. Unlike some other previously unused InnoDB data fields, this one was actually always zero-initialized, at least since MySQL 3.23.49. The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the root page. The SX latch will allow concurrent read access to the root page. (The field PAGE_ROOT_AUTO_INC will only be read on the first-time call to ha_innobase::open() from the SQL layer. The PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so read/write races are not possible.) During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level function btr_cur_search_to_nth_level(), adding no extra page access. [Adaptive hash index lookup will be disabled during INSERT.] If some rare UPDATE modifies an AUTO_INCREMENT column, the PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in ha_innobase::update_row(). When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC field. During ALTER TABLE, the initial AUTO_INCREMENT value will be copied from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will update PAGE_ROOT_AUTO_INC in real time. innodb_col_no(): Determine the dict_table_t::cols[] element index corresponding to a Field of a non-virtual column. (The MySQL 5.7 implementation of virtual columns breaks the 1:1 relationship between Field::field_index and dict_table_t::cols[]. Virtual columns are omitted from dict_table_t::cols[]. Therefore, we must translate the field_index of AUTO_INCREMENT columns into an index of dict_table_t::cols[].) Upgrade from old data files: By default, the AUTO_INCREMENT sequence in old data files would appear to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain the value 0 in each clustered index page. In new data files, PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain any AUTO_INCREMENT column. For backward compatibility, we use the old method of SELECT MAX(auto_increment_column) for initializing the sequence. btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format data file. btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc() that will resort to reading MAX(auto_increment_column) for data files that did not use AUTO_INCREMENT yet. It was manually tested that during the execution of innodb.autoinc_persist the compatibility logic is not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty clustered index root pages). initialize_auto_increment(): Replaces ha_innobase::innobase_initialize_autoinc(). This initializes the AUTO_INCREMENT metadata. Only called from ha_innobase::open(). ha_innobase::info_low(): Do not try to lazily initialize dict_table_t::autoinc. It must already have been initialized by ha_innobase::open() or ha_innobase::create(). Note: The adjustments to class ha_innopart were not tested, because the source code (native InnoDB partitioning) is not being compiled.
9 years ago
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with the notable difference that the file format changes are limited to repurposing a previously unused data field in B-tree pages. For persistent InnoDB tables, write the last used AUTO_INCREMENT value to the root page of the clustered index, in the previously unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC. Unlike some other previously unused InnoDB data fields, this one was actually always zero-initialized, at least since MySQL 3.23.49. The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the root page. The SX latch will allow concurrent read access to the root page. (The field PAGE_ROOT_AUTO_INC will only be read on the first-time call to ha_innobase::open() from the SQL layer. The PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so read/write races are not possible.) During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level function btr_cur_search_to_nth_level(), adding no extra page access. [Adaptive hash index lookup will be disabled during INSERT.] If some rare UPDATE modifies an AUTO_INCREMENT column, the PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in ha_innobase::update_row(). When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC field. During ALTER TABLE, the initial AUTO_INCREMENT value will be copied from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will update PAGE_ROOT_AUTO_INC in real time. innodb_col_no(): Determine the dict_table_t::cols[] element index corresponding to a Field of a non-virtual column. (The MySQL 5.7 implementation of virtual columns breaks the 1:1 relationship between Field::field_index and dict_table_t::cols[]. Virtual columns are omitted from dict_table_t::cols[]. Therefore, we must translate the field_index of AUTO_INCREMENT columns into an index of dict_table_t::cols[].) Upgrade from old data files: By default, the AUTO_INCREMENT sequence in old data files would appear to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain the value 0 in each clustered index page. In new data files, PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain any AUTO_INCREMENT column. For backward compatibility, we use the old method of SELECT MAX(auto_increment_column) for initializing the sequence. btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format data file. btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc() that will resort to reading MAX(auto_increment_column) for data files that did not use AUTO_INCREMENT yet. It was manually tested that during the execution of innodb.autoinc_persist the compatibility logic is not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty clustered index root pages). initialize_auto_increment(): Replaces ha_innobase::innobase_initialize_autoinc(). This initializes the AUTO_INCREMENT metadata. Only called from ha_innobase::open(). ha_innobase::info_low(): Do not try to lazily initialize dict_table_t::autoinc. It must already have been initialized by ha_innobase::open() or ha_innobase::create(). Note: The adjustments to class ha_innopart were not tested, because the source code (native InnoDB partitioning) is not being compiled.
9 years ago
Merge 10.1 into 10.2 This only merges MDEV-12253, adapting it to MDEV-12602 which is already present in 10.2 but not yet in the 10.1 revision that is being merged. TODO: Error handling in crash recovery needs to be improved. If a page cannot be decrypted (or read), we should cleanly abort the startup. If innodb_force_recovery is specified, we should ignore the problematic page and apply redo log to other pages. Currently, the test encryption.innodb-redo-badkey randomly fails like this (the last messages are from cmake -DWITH_ASAN): 2017-05-05 10:19:40 140037071685504 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1635994 2017-05-05 10:19:40 140037071685504 [ERROR] InnoDB: Missing MLOG_FILE_NAME or MLOG_FILE_DELETE before MLOG_CHECKPOINT for tablespace 1 2017-05-05 10:19:40 140037071685504 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[2201] with error Data structure corruption 2017-05-05 10:19:41 140037071685504 [Note] InnoDB: Starting shutdown... i================================================================= ==5226==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x612000018588 in thread T0 #0 0x736750 in operator delete(void*) (/mariadb/server/build/sql/mysqld+0x736750) #1 0x1e4833f in LatchCounter::~LatchCounter() /mariadb/server/storage/innobase/include/sync0types.h:599:4 #2 0x1e480b8 in LatchMeta<LatchCounter>::~LatchMeta() /mariadb/server/storage/innobase/include/sync0types.h:786:17 #3 0x1e35509 in sync_latch_meta_destroy() /mariadb/server/storage/innobase/sync/sync0debug.cc:1622:3 #4 0x1e35314 in sync_check_close() /mariadb/server/storage/innobase/sync/sync0debug.cc:1839:2 #5 0x1dfdc18 in innodb_shutdown() /mariadb/server/storage/innobase/srv/srv0start.cc:2888:2 #6 0x197e5e6 in innobase_init(void*) /mariadb/server/storage/innobase/handler/ha_innodb.cc:4475:3
9 years ago
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with the notable difference that the file format changes are limited to repurposing a previously unused data field in B-tree pages. For persistent InnoDB tables, write the last used AUTO_INCREMENT value to the root page of the clustered index, in the previously unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC. Unlike some other previously unused InnoDB data fields, this one was actually always zero-initialized, at least since MySQL 3.23.49. The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the root page. The SX latch will allow concurrent read access to the root page. (The field PAGE_ROOT_AUTO_INC will only be read on the first-time call to ha_innobase::open() from the SQL layer. The PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so read/write races are not possible.) During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level function btr_cur_search_to_nth_level(), adding no extra page access. [Adaptive hash index lookup will be disabled during INSERT.] If some rare UPDATE modifies an AUTO_INCREMENT column, the PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in ha_innobase::update_row(). When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC field. During ALTER TABLE, the initial AUTO_INCREMENT value will be copied from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will update PAGE_ROOT_AUTO_INC in real time. innodb_col_no(): Determine the dict_table_t::cols[] element index corresponding to a Field of a non-virtual column. (The MySQL 5.7 implementation of virtual columns breaks the 1:1 relationship between Field::field_index and dict_table_t::cols[]. Virtual columns are omitted from dict_table_t::cols[]. Therefore, we must translate the field_index of AUTO_INCREMENT columns into an index of dict_table_t::cols[].) Upgrade from old data files: By default, the AUTO_INCREMENT sequence in old data files would appear to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain the value 0 in each clustered index page. In new data files, PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain any AUTO_INCREMENT column. For backward compatibility, we use the old method of SELECT MAX(auto_increment_column) for initializing the sequence. btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format data file. btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc() that will resort to reading MAX(auto_increment_column) for data files that did not use AUTO_INCREMENT yet. It was manually tested that during the execution of innodb.autoinc_persist the compatibility logic is not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty clustered index root pages). initialize_auto_increment(): Replaces ha_innobase::innobase_initialize_autoinc(). This initializes the AUTO_INCREMENT metadata. Only called from ha_innobase::open(). ha_innobase::info_low(): Do not try to lazily initialize dict_table_t::autoinc. It must already have been initialized by ha_innobase::open() or ha_innobase::create(). Note: The adjustments to class ha_innopart were not tested, because the source code (native InnoDB partitioning) is not being compiled.
9 years ago
MDEV-12253: Buffer pool blocks are accessed after they have been freed Problem was that bpage was referenced after it was already freed from LRU. Fixed by adding a new variable encrypted that is passed down to buf_page_check_corrupt() and used in buf_page_get_gen() to stop processing page read. This patch should also address following test failures and bugs: MDEV-12419: IMPORT should not look up tablespace in PageConverter::validate(). This is now removed. MDEV-10099: encryption.innodb_onlinealter_encryption fails sporadically in buildbot MDEV-11420: encryption.innodb_encryption-page-compression failed in buildbot MDEV-11222: encryption.encrypt_and_grep failed in buildbot on P8 Removed dict_table_t::is_encrypted and dict_table_t::ibd_file_missing and replaced these with dict_table_t::file_unreadable. Table ibd file is missing if fil_get_space(space_id) returns NULL and encrypted if not. Removed dict_table_t::is_corrupted field. Ported FilSpace class from 10.2 and using that on buf_page_check_corrupt(), buf_page_decrypt_after_read(), buf_page_encrypt_before_write(), buf_dblwr_process(), buf_read_page(), dict_stats_save_defrag_stats(). Added test cases when enrypted page could be read while doing redo log crash recovery. Also added test case for row compressed blobs. btr_cur_open_at_index_side_func(), btr_cur_open_at_rnd_pos_func(): Avoid referencing block that is NULL. buf_page_get_zip(): Issue error if page read fails. buf_page_get_gen(): Use dberr_t for error detection and do not reference bpage after we hare freed it. buf_mark_space_corrupt(): remove bpage from LRU also when it is encrypted. buf_page_check_corrupt(): @return DB_SUCCESS if page has been read and is not corrupted, DB_PAGE_CORRUPTED if page based on checksum check is corrupted, DB_DECRYPTION_FAILED if page post encryption checksum matches but after decryption normal page checksum does not match. In read case only DB_SUCCESS is possible. buf_page_io_complete(): use dberr_t for error handling. buf_flush_write_block_low(), buf_read_ahead_random(), buf_read_page_async(), buf_read_ahead_linear(), buf_read_ibuf_merge_pages(), buf_read_recv_pages(), fil_aio_wait(): Issue error if page read fails. btr_pcur_move_to_next_page(): Do not reference page if it is NULL. Introduced dict_table_t::is_readable() and dict_index_t::is_readable() that will return true if tablespace exists and pages read from tablespace are not corrupted or page decryption failed. Removed buf_page_t::key_version. After page decryption the key version is not removed from page frame. For unencrypted pages, old key_version is removed at buf_page_encrypt_before_write() dict_stats_update_transient_for_index(), dict_stats_update_transient() Do not continue if table decryption failed or table is corrupted. dict0stats.cc: Introduced a dict_stats_report_error function to avoid code duplication. fil_parse_write_crypt_data(): Check that key read from redo log entry is found from encryption plugin and if it is not, refuse to start. PageConverter::validate(): Removed access to fil_space_t as tablespace is not available during import. Fixed error code on innodb.innodb test. Merged test cased innodb-bad-key-change5 and innodb-bad-key-shutdown to innodb-bad-key-change2. Removed innodb-bad-key-change5 test. Decreased unnecessary complexity on some long lasting tests. Removed fil_inc_pending_ops(), fil_decr_pending_ops(), fil_get_first_space(), fil_get_next_space(), fil_get_first_space_safe(), fil_get_next_space_safe() functions. fil_space_verify_crypt_checksum(): Fixed bug found using ASAN where FIL_PAGE_END_LSN_OLD_CHECKSUM field was incorrectly accessed from row compressed tables. Fixed out of page frame bug for row compressed tables in fil_space_verify_crypt_checksum() found using ASAN. Incorrect function was called for compressed table. Added new tests for discard, rename table and drop (we should allow them even when page decryption fails). Alter table rename is not allowed. Added test for restart with innodb-force-recovery=1 when page read on redo-recovery cant be decrypted. Added test for corrupted table where both page data and FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION is corrupted. Adjusted the test case innodb_bug14147491 so that it does not anymore expect crash. Instead table is just mostly not usable. fil0fil.h: fil_space_acquire_low is not visible function and fil_space_acquire and fil_space_acquire_silent are inline functions. FilSpace class uses fil_space_acquire_low directly. recv_apply_hashed_log_recs() does not return anything.
9 years ago
Merge 10.1 into 10.2 This only merges MDEV-12253, adapting it to MDEV-12602 which is already present in 10.2 but not yet in the 10.1 revision that is being merged. TODO: Error handling in crash recovery needs to be improved. If a page cannot be decrypted (or read), we should cleanly abort the startup. If innodb_force_recovery is specified, we should ignore the problematic page and apply redo log to other pages. Currently, the test encryption.innodb-redo-badkey randomly fails like this (the last messages are from cmake -DWITH_ASAN): 2017-05-05 10:19:40 140037071685504 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1635994 2017-05-05 10:19:40 140037071685504 [ERROR] InnoDB: Missing MLOG_FILE_NAME or MLOG_FILE_DELETE before MLOG_CHECKPOINT for tablespace 1 2017-05-05 10:19:40 140037071685504 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[2201] with error Data structure corruption 2017-05-05 10:19:41 140037071685504 [Note] InnoDB: Starting shutdown... i================================================================= ==5226==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x612000018588 in thread T0 #0 0x736750 in operator delete(void*) (/mariadb/server/build/sql/mysqld+0x736750) #1 0x1e4833f in LatchCounter::~LatchCounter() /mariadb/server/storage/innobase/include/sync0types.h:599:4 #2 0x1e480b8 in LatchMeta<LatchCounter>::~LatchMeta() /mariadb/server/storage/innobase/include/sync0types.h:786:17 #3 0x1e35509 in sync_latch_meta_destroy() /mariadb/server/storage/innobase/sync/sync0debug.cc:1622:3 #4 0x1e35314 in sync_check_close() /mariadb/server/storage/innobase/sync/sync0debug.cc:1839:2 #5 0x1dfdc18 in innodb_shutdown() /mariadb/server/storage/innobase/srv/srv0start.cc:2888:2 #6 0x197e5e6 in innobase_init(void*) /mariadb/server/storage/innobase/handler/ha_innodb.cc:4475:3
9 years ago
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with the notable difference that the file format changes are limited to repurposing a previously unused data field in B-tree pages. For persistent InnoDB tables, write the last used AUTO_INCREMENT value to the root page of the clustered index, in the previously unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC. Unlike some other previously unused InnoDB data fields, this one was actually always zero-initialized, at least since MySQL 3.23.49. The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the root page. The SX latch will allow concurrent read access to the root page. (The field PAGE_ROOT_AUTO_INC will only be read on the first-time call to ha_innobase::open() from the SQL layer. The PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so read/write races are not possible.) During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level function btr_cur_search_to_nth_level(), adding no extra page access. [Adaptive hash index lookup will be disabled during INSERT.] If some rare UPDATE modifies an AUTO_INCREMENT column, the PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in ha_innobase::update_row(). When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC field. During ALTER TABLE, the initial AUTO_INCREMENT value will be copied from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will update PAGE_ROOT_AUTO_INC in real time. innodb_col_no(): Determine the dict_table_t::cols[] element index corresponding to a Field of a non-virtual column. (The MySQL 5.7 implementation of virtual columns breaks the 1:1 relationship between Field::field_index and dict_table_t::cols[]. Virtual columns are omitted from dict_table_t::cols[]. Therefore, we must translate the field_index of AUTO_INCREMENT columns into an index of dict_table_t::cols[].) Upgrade from old data files: By default, the AUTO_INCREMENT sequence in old data files would appear to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain the value 0 in each clustered index page. In new data files, PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain any AUTO_INCREMENT column. For backward compatibility, we use the old method of SELECT MAX(auto_increment_column) for initializing the sequence. btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format data file. btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc() that will resort to reading MAX(auto_increment_column) for data files that did not use AUTO_INCREMENT yet. It was manually tested that during the execution of innodb.autoinc_persist the compatibility logic is not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty clustered index root pages). initialize_auto_increment(): Replaces ha_innobase::innobase_initialize_autoinc(). This initializes the AUTO_INCREMENT metadata. Only called from ha_innobase::open(). ha_innobase::info_low(): Do not try to lazily initialize dict_table_t::autoinc. It must already have been initialized by ha_innobase::open() or ha_innobase::create(). Note: The adjustments to class ha_innopart were not tested, because the source code (native InnoDB partitioning) is not being compiled.
9 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-12253: Buffer pool blocks are accessed after they have been freed Problem was that bpage was referenced after it was already freed from LRU. Fixed by adding a new variable encrypted that is passed down to buf_page_check_corrupt() and used in buf_page_get_gen() to stop processing page read. This patch should also address following test failures and bugs: MDEV-12419: IMPORT should not look up tablespace in PageConverter::validate(). This is now removed. MDEV-10099: encryption.innodb_onlinealter_encryption fails sporadically in buildbot MDEV-11420: encryption.innodb_encryption-page-compression failed in buildbot MDEV-11222: encryption.encrypt_and_grep failed in buildbot on P8 Removed dict_table_t::is_encrypted and dict_table_t::ibd_file_missing and replaced these with dict_table_t::file_unreadable. Table ibd file is missing if fil_get_space(space_id) returns NULL and encrypted if not. Removed dict_table_t::is_corrupted field. Ported FilSpace class from 10.2 and using that on buf_page_check_corrupt(), buf_page_decrypt_after_read(), buf_page_encrypt_before_write(), buf_dblwr_process(), buf_read_page(), dict_stats_save_defrag_stats(). Added test cases when enrypted page could be read while doing redo log crash recovery. Also added test case for row compressed blobs. btr_cur_open_at_index_side_func(), btr_cur_open_at_rnd_pos_func(): Avoid referencing block that is NULL. buf_page_get_zip(): Issue error if page read fails. buf_page_get_gen(): Use dberr_t for error detection and do not reference bpage after we hare freed it. buf_mark_space_corrupt(): remove bpage from LRU also when it is encrypted. buf_page_check_corrupt(): @return DB_SUCCESS if page has been read and is not corrupted, DB_PAGE_CORRUPTED if page based on checksum check is corrupted, DB_DECRYPTION_FAILED if page post encryption checksum matches but after decryption normal page checksum does not match. In read case only DB_SUCCESS is possible. buf_page_io_complete(): use dberr_t for error handling. buf_flush_write_block_low(), buf_read_ahead_random(), buf_read_page_async(), buf_read_ahead_linear(), buf_read_ibuf_merge_pages(), buf_read_recv_pages(), fil_aio_wait(): Issue error if page read fails. btr_pcur_move_to_next_page(): Do not reference page if it is NULL. Introduced dict_table_t::is_readable() and dict_index_t::is_readable() that will return true if tablespace exists and pages read from tablespace are not corrupted or page decryption failed. Removed buf_page_t::key_version. After page decryption the key version is not removed from page frame. For unencrypted pages, old key_version is removed at buf_page_encrypt_before_write() dict_stats_update_transient_for_index(), dict_stats_update_transient() Do not continue if table decryption failed or table is corrupted. dict0stats.cc: Introduced a dict_stats_report_error function to avoid code duplication. fil_parse_write_crypt_data(): Check that key read from redo log entry is found from encryption plugin and if it is not, refuse to start. PageConverter::validate(): Removed access to fil_space_t as tablespace is not available during import. Fixed error code on innodb.innodb test. Merged test cased innodb-bad-key-change5 and innodb-bad-key-shutdown to innodb-bad-key-change2. Removed innodb-bad-key-change5 test. Decreased unnecessary complexity on some long lasting tests. Removed fil_inc_pending_ops(), fil_decr_pending_ops(), fil_get_first_space(), fil_get_next_space(), fil_get_first_space_safe(), fil_get_next_space_safe() functions. fil_space_verify_crypt_checksum(): Fixed bug found using ASAN where FIL_PAGE_END_LSN_OLD_CHECKSUM field was incorrectly accessed from row compressed tables. Fixed out of page frame bug for row compressed tables in fil_space_verify_crypt_checksum() found using ASAN. Incorrect function was called for compressed table. Added new tests for discard, rename table and drop (we should allow them even when page decryption fails). Alter table rename is not allowed. Added test for restart with innodb-force-recovery=1 when page read on redo-recovery cant be decrypted. Added test for corrupted table where both page data and FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION is corrupted. Adjusted the test case innodb_bug14147491 so that it does not anymore expect crash. Instead table is just mostly not usable. fil0fil.h: fil_space_acquire_low is not visible function and fil_space_acquire and fil_space_acquire_silent are inline functions. FilSpace class uses fil_space_acquire_low directly. recv_apply_hashed_log_recs() does not return anything.
9 years ago
10 years ago
11 years ago
7 years ago
7 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-11415 Remove excessive undo logging during ALTER TABLE…ALGORITHM=COPY If a crash occurs during ALTER TABLE…ALGORITHM=COPY, InnoDB would spend a lot of time rolling back writes to the intermediate copy of the table. To reduce the amount of busy work done, a work-around was introduced in commit fd069e2bb36a3c1c1f26d65dd298b07e6d83ac8b in MySQL 4.1.8 and 5.0.2, to commit the transaction after every 10,000 inserted rows. A proper fix would have been to disable the undo logging altogether and to simply drop the intermediate copy of the table on subsequent server startup. This is what happens in MariaDB 10.3 with MDEV-14717,MDEV-14585. In MariaDB 10.2, the intermediate copy of the table would be left behind with a name starting with the string #sql. This is a backport of a bug fix from MySQL 8.0.0 to MariaDB, contributed by jixianliang <271365745@qq.com>. Unlike recent MySQL, MariaDB supports ALTER IGNORE. For that operation InnoDB must for now keep the undo logging enabled, so that the latest row can be rolled back in case of an error. In Galera cluster, the LOAD DATA statement will retain the existing behaviour and commit the transaction after every 10,000 rows if the parameter wsrep_load_data_splitting=ON is set. The logic to do so (the wsrep_load_data_split() function and the call handler::extra(HA_EXTRA_FAKE_START_STMT)) are joint work by Ji Xianliang and Marko Mäkelä. The original fix: Author: Thirunarayanan Balathandayuthapani <thirunarayanan.balathandayuth@oracle.com> Date: Wed Dec 2 16:09:15 2015 +0530 Bug#17479594 AVOID INTERMEDIATE COMMIT WHILE DOING ALTER TABLE ALGORITHM=COPY Problem: During ALTER TABLE, we commit and restart the transaction for every 10,000 rows, so that the rollback after recovery would not take so long. Fix: Suppress the undo logging during copy alter operation. If fts_index is present then insert directly into fts auxiliary table rather than doing at commit time. ha_innobase::num_write_row: Remove the variable. ha_innobase::write_row(): Remove the hack for committing every 10000 rows. row_lock_table_for_mysql(): Remove the extra 2 parameters. lock_get_src_table(), lock_is_table_exclusive(): Remove. Reviewed-by: Marko Mäkelä <marko.makela@oracle.com> Reviewed-by: Shaohua Wang <shaohua.wang@oracle.com> Reviewed-by: Jon Olav Hauglid <jon.hauglid@oracle.com>
8 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().
8 years ago
MDEV-20812 Unexpected ER_ROW_IS_REFERENCED_2 upon DELETE from versioned table with FK MDEV-16210 original case was wrongly allowed versioned DELETE from referenced table where reference is by non-primary key. InnoDB UPDATE has optimization for new rows not changing its clustered index position. In this case InnoDB doesn't update all secondary indexes and misses the one holding the referenced key. The fix was to disable this optimization for versioned DELETE. In case of versioned DELETE we forcely update all secondary indexes and therefore check them for constraints. But the above fix raised another problem with versioned DELETE on foreign table side. In case when there was no corresponding record in referenced table (illegal foreign reference can be done with "set foreign_key_checks=off") there was spurious constraint check (because versioned DELETE is actually UPDATE) and hence the operation failed with constraint error. MDEV-16210 tried to fix the above problem by checking foreign table instead of referenced table and that at least was illegal. Constraint check is done by row_ins_check_foreign_constraint() no matter what kind of table is checked, referenced or foreign (controlled by check_ref argument). Referenced table is checked by row_upd_check_references_constraints(). Foreign table is checked by row_ins_check_foreign_constraints(). Current fix rolls back the wrong fix for the above problem and disables referenced table check for DELETE on foreign side by introducing `check_foreign` argument which when set to *false* skips row_ins_check_foreign_constraints() call.
6 years ago
MDEV-11415 Remove excessive undo logging during ALTER TABLE…ALGORITHM=COPY If a crash occurs during ALTER TABLE…ALGORITHM=COPY, InnoDB would spend a lot of time rolling back writes to the intermediate copy of the table. To reduce the amount of busy work done, a work-around was introduced in commit fd069e2bb36a3c1c1f26d65dd298b07e6d83ac8b in MySQL 4.1.8 and 5.0.2, to commit the transaction after every 10,000 inserted rows. A proper fix would have been to disable the undo logging altogether and to simply drop the intermediate copy of the table on subsequent server startup. This is what happens in MariaDB 10.3 with MDEV-14717,MDEV-14585. In MariaDB 10.2, the intermediate copy of the table would be left behind with a name starting with the string #sql. This is a backport of a bug fix from MySQL 8.0.0 to MariaDB, contributed by jixianliang <271365745@qq.com>. Unlike recent MySQL, MariaDB supports ALTER IGNORE. For that operation InnoDB must for now keep the undo logging enabled, so that the latest row can be rolled back in case of an error. In Galera cluster, the LOAD DATA statement will retain the existing behaviour and commit the transaction after every 10,000 rows if the parameter wsrep_load_data_splitting=ON is set. The logic to do so (the wsrep_load_data_split() function and the call handler::extra(HA_EXTRA_FAKE_START_STMT)) are joint work by Ji Xianliang and Marko Mäkelä. The original fix: Author: Thirunarayanan Balathandayuthapani <thirunarayanan.balathandayuth@oracle.com> Date: Wed Dec 2 16:09:15 2015 +0530 Bug#17479594 AVOID INTERMEDIATE COMMIT WHILE DOING ALTER TABLE ALGORITHM=COPY Problem: During ALTER TABLE, we commit and restart the transaction for every 10,000 rows, so that the rollback after recovery would not take so long. Fix: Suppress the undo logging during copy alter operation. If fts_index is present then insert directly into fts auxiliary table rather than doing at commit time. ha_innobase::num_write_row: Remove the variable. ha_innobase::write_row(): Remove the hack for committing every 10000 rows. row_lock_table_for_mysql(): Remove the extra 2 parameters. lock_get_src_table(), lock_is_table_exclusive(): Remove. Reviewed-by: Marko Mäkelä <marko.makela@oracle.com> Reviewed-by: Shaohua Wang <shaohua.wang@oracle.com> Reviewed-by: Jon Olav Hauglid <jon.hauglid@oracle.com>
8 years ago
MDEV-20812 Unexpected ER_ROW_IS_REFERENCED_2 upon DELETE from versioned table with FK MDEV-16210 original case was wrongly allowed versioned DELETE from referenced table where reference is by non-primary key. InnoDB UPDATE has optimization for new rows not changing its clustered index position. In this case InnoDB doesn't update all secondary indexes and misses the one holding the referenced key. The fix was to disable this optimization for versioned DELETE. In case of versioned DELETE we forcely update all secondary indexes and therefore check them for constraints. But the above fix raised another problem with versioned DELETE on foreign table side. In case when there was no corresponding record in referenced table (illegal foreign reference can be done with "set foreign_key_checks=off") there was spurious constraint check (because versioned DELETE is actually UPDATE) and hence the operation failed with constraint error. MDEV-16210 tried to fix the above problem by checking foreign table instead of referenced table and that at least was illegal. Constraint check is done by row_ins_check_foreign_constraint() no matter what kind of table is checked, referenced or foreign (controlled by check_ref argument). Referenced table is checked by row_upd_check_references_constraints(). Foreign table is checked by row_ins_check_foreign_constraints(). Current fix rolls back the wrong fix for the above problem and disables referenced table check for DELETE on foreign side by introducing `check_foreign` argument which when set to *false* skips row_ins_check_foreign_constraints() call.
6 years ago
MDEV-20812 Unexpected ER_ROW_IS_REFERENCED_2 upon DELETE from versioned table with FK MDEV-16210 original case was wrongly allowed versioned DELETE from referenced table where reference is by non-primary key. InnoDB UPDATE has optimization for new rows not changing its clustered index position. In this case InnoDB doesn't update all secondary indexes and misses the one holding the referenced key. The fix was to disable this optimization for versioned DELETE. In case of versioned DELETE we forcely update all secondary indexes and therefore check them for constraints. But the above fix raised another problem with versioned DELETE on foreign table side. In case when there was no corresponding record in referenced table (illegal foreign reference can be done with "set foreign_key_checks=off") there was spurious constraint check (because versioned DELETE is actually UPDATE) and hence the operation failed with constraint error. MDEV-16210 tried to fix the above problem by checking foreign table instead of referenced table and that at least was illegal. Constraint check is done by row_ins_check_foreign_constraint() no matter what kind of table is checked, referenced or foreign (controlled by check_ref argument). Referenced table is checked by row_upd_check_references_constraints(). Foreign table is checked by row_ins_check_foreign_constraints(). Current fix rolls back the wrong fix for the above problem and disables referenced table check for DELETE on foreign side by introducing `check_foreign` argument which when set to *false* skips row_ins_check_foreign_constraints() call.
6 years ago
MDEV-11415 Remove excessive undo logging during ALTER TABLE…ALGORITHM=COPY If a crash occurs during ALTER TABLE…ALGORITHM=COPY, InnoDB would spend a lot of time rolling back writes to the intermediate copy of the table. To reduce the amount of busy work done, a work-around was introduced in commit fd069e2bb36a3c1c1f26d65dd298b07e6d83ac8b in MySQL 4.1.8 and 5.0.2, to commit the transaction after every 10,000 inserted rows. A proper fix would have been to disable the undo logging altogether and to simply drop the intermediate copy of the table on subsequent server startup. This is what happens in MariaDB 10.3 with MDEV-14717,MDEV-14585. In MariaDB 10.2, the intermediate copy of the table would be left behind with a name starting with the string #sql. This is a backport of a bug fix from MySQL 8.0.0 to MariaDB, contributed by jixianliang <271365745@qq.com>. Unlike recent MySQL, MariaDB supports ALTER IGNORE. For that operation InnoDB must for now keep the undo logging enabled, so that the latest row can be rolled back in case of an error. In Galera cluster, the LOAD DATA statement will retain the existing behaviour and commit the transaction after every 10,000 rows if the parameter wsrep_load_data_splitting=ON is set. The logic to do so (the wsrep_load_data_split() function and the call handler::extra(HA_EXTRA_FAKE_START_STMT)) are joint work by Ji Xianliang and Marko Mäkelä. The original fix: Author: Thirunarayanan Balathandayuthapani <thirunarayanan.balathandayuth@oracle.com> Date: Wed Dec 2 16:09:15 2015 +0530 Bug#17479594 AVOID INTERMEDIATE COMMIT WHILE DOING ALTER TABLE ALGORITHM=COPY Problem: During ALTER TABLE, we commit and restart the transaction for every 10,000 rows, so that the rollback after recovery would not take so long. Fix: Suppress the undo logging during copy alter operation. If fts_index is present then insert directly into fts auxiliary table rather than doing at commit time. ha_innobase::num_write_row: Remove the variable. ha_innobase::write_row(): Remove the hack for committing every 10000 rows. row_lock_table_for_mysql(): Remove the extra 2 parameters. lock_get_src_table(), lock_is_table_exclusive(): Remove. Reviewed-by: Marko Mäkelä <marko.makela@oracle.com> Reviewed-by: Shaohua Wang <shaohua.wang@oracle.com> Reviewed-by: Jon Olav Hauglid <jon.hauglid@oracle.com>
8 years ago
MDEV-13637 InnoDB change buffer housekeeping can cause redo log overrun and possibly deadlocks The function ibuf_remove_free_page() may be called while the caller is holding several mutexes or rw-locks. Because of this, this housekeeping loop may cause performance glitches for operations that involve tables that are stored in the InnoDB system tablespace. Also deadlocks might be possible. The worst impact of all is that due to the mutexes being held, calls to log_free_check() had to be skipped during this housekeeping. This means that the cyclic InnoDB redo log may be overwritten. If the system crashes during this, it would be unable to recover. The entry point to the problematic code is ibuf_free_excess_pages(). It would make sense to call it before acquiring any mutexes or rw-locks, in any 'pessimistic' operation that involves the system tablespace. fseg_create_general(), fseg_alloc_free_page_general(): Do not call ibuf_free_excess_pages() while potentially holding some latches. ibuf_remove_free_page(): Do call log_free_check(), like every operation that is about to generate redo log should do. ibuf_free_excess_pages(): Remove some assertions that are replaced by stricter assertions in the log_free_check() that is now called by ibuf_remove_free_page(). row_mtr_start(): New function, to perform necessary preparations when starting a mini-transaction for row operations. For pessimistic operations on secondary indexes that are located in the system tablespace, this includes calling ibuf_free_excess_pages(). row_undo_ins_remove_sec_low(), row_undo_mod_del_mark_or_remove_sec_low(), row_undo_mod_del_unmark_sec_and_undo_update(): Call row_mtr_start(). row_ins_sec_index_entry(): Call ibuf_free_excess_pages() if the operation may involve allocating pages and change buffering in the system tablespace. row_upd_sec_index_entry(): Slightly refactor the code. The delete-marking of the old entry is done in-place. It could be change-buffered, but the old code should be unlikely to have invoked ibuf_free_excess_pages() in this case.
8 years ago
12 years ago
MDEV-20812 Unexpected ER_ROW_IS_REFERENCED_2 upon DELETE from versioned table with FK MDEV-16210 original case was wrongly allowed versioned DELETE from referenced table where reference is by non-primary key. InnoDB UPDATE has optimization for new rows not changing its clustered index position. In this case InnoDB doesn't update all secondary indexes and misses the one holding the referenced key. The fix was to disable this optimization for versioned DELETE. In case of versioned DELETE we forcely update all secondary indexes and therefore check them for constraints. But the above fix raised another problem with versioned DELETE on foreign table side. In case when there was no corresponding record in referenced table (illegal foreign reference can be done with "set foreign_key_checks=off") there was spurious constraint check (because versioned DELETE is actually UPDATE) and hence the operation failed with constraint error. MDEV-16210 tried to fix the above problem by checking foreign table instead of referenced table and that at least was illegal. Constraint check is done by row_ins_check_foreign_constraint() no matter what kind of table is checked, referenced or foreign (controlled by check_ref argument). Referenced table is checked by row_upd_check_references_constraints(). Foreign table is checked by row_ins_check_foreign_constraints(). Current fix rolls back the wrong fix for the above problem and disables referenced table check for DELETE on foreign side by introducing `check_foreign` argument which when set to *false* skips row_ins_check_foreign_constraints() call.
6 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
10 years ago
MDEV-15662 Instant DROP COLUMN or changing the order of columns Allow ADD COLUMN anywhere in a table, not only adding as the last column. Allow instant DROP COLUMN and instant changing the order of columns. The added columns will always be added last in clustered index records. In new records, instantly dropped columns will be stored as NULL or empty when possible. Information about dropped and reordered columns will be written in a metadata BLOB (mblob), which is stored before the first 'user' field in the hidden metadata record at the start of the clustered index. The presence of mblob is indicated by setting the delete-mark flag in the metadata record. The metadata BLOB stores the number of clustered index fields, followed by an array of column information for each field. For dropped columns, we store the NOT NULL flag, the fixed length, and for variable-length columns, whether the maximum length exceeded 255 bytes. For non-dropped columns, we store the column position. Unlike with MDEV-11369, when a table becomes empty, it cannot be converted back to the canonical format. The reason for this is that other threads may hold cached objects such as row_prebuilt_t::ins_node that could refer to dropped or reordered index fields. For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, we must store the n_core_null_bytes in the root page, so that the chain of node pointer records can be followed in order to reach the leftmost leaf page where the metadata record is located. If the mblob is present, we will zero-initialize the strings "infimum" and "supremum" in the root page, and use the last byte of "supremum" for storing the number of null bytes (which are allocated but useless on node pointer pages). This is necessary for btr_cur_instant_init_metadata() to be able to navigate to the mblob. If the PRIMARY KEY contains any variable-length column and some nullable columns were instantly dropped, the dict_index_t::n_nullable in the data dictionary could be smaller than it actually is in the non-leaf pages. Because of this, the non-leaf pages could use more bytes for the null flags than the data dictionary expects, and we could be reading the lengths of the variable-length columns from the wrong offset, and thus reading the child page number from wrong place. This is the result of two design mistakes that involve unnecessary storage of data: First, it is nonsense to store any data fields for the leftmost node pointer records, because the comparisons would be resolved by the MIN_REC_FLAG alone. Second, there cannot be any null fields in the clustered index node pointer fields, but we nevertheless reserve space for all the null flags. Limitations (future work): MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists MDEV-17468 Avoid table rebuild on operations on generated columns MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large btr_page_reorganize_low(): Preserve any metadata in the root page. Call lock_move_reorganize_page() only after restoring the "infimum" and "supremum" records, to avoid a memcmp() assertion failure. dict_col_t::DROPPED: Magic value for dict_col_t::ind. dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant(). Do not assert that the column was instantly added, because we sometimes call this unconditionally for all columns. Convert an instantly added column to a "core column". The old name remove_instant() could be mistaken to refer to "instant DROP COLUMN". dict_col_t::is_added(): Rename from dict_col_t::is_instant(). dtype_t::metadata_blob_init(): Initialize the mblob data type. dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(), upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits refer to a metadata record. dict_table_t::instant: Metadata about dropped or reordered columns. dict_table_t::prepare_instant(): Prepare ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE. innobase_instant_try() will pass this to dict_table_t::instant_column(). On rollback, dict_table_t::rollback_instant() will be called. dict_table_t::instant_column(): Renamed from instant_add_column(). Add the parameter col_map so that columns can be reordered. Copy and adjust v_cols[] as well. dict_table_t::find(): Find an old column based on a new column number. dict_table_t::serialise_columns(), dict_table_t::deserialise_columns(): Convert the mblob. dict_index_t::instant_metadata(): Create the metadata record for instant ALTER TABLE. Invoke dict_table_t::serialise_columns(). dict_index_t::reconstruct_fields(): Invoked by dict_table_t::deserialise_columns(). dict_index_t::clear_instant_alter(): Move the fields for the dropped columns to the end, and sort the surviving index fields in ascending order of column position. ha_innobase::check_if_supported_inplace_alter(): Do not allow adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.) instant_alter_column_possible(): Add a parameter for InnoDB table, to check for additional conditions, such as the maximum number of index fields. ha_innobase_inplace_ctx::first_alter_pos: The first column whose position is affected by instant ADD, DROP, or changing the order of columns. innobase_build_col_map(): Skip added virtual columns. prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol. Remove some unnecessary code. Note that the call to innodb_base_col_setup() should be executed later. commit_try_norebuild(): If ctx->is_instant(), let the virtual columns be added or dropped by innobase_instant_try(). innobase_instant_try(): Fill in a zero default value for the hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459). If any columns were dropped or reordered (or added not last), delete any SYS_COLUMNS records for the following columns, and insert SYS_COLUMNS records for all subsequent stored columns as well as for all virtual columns. If any virtual column is dropped, rewrite all virtual column metadata. Use a shortcut only for adding virtual columns. This is because innobase_drop_virtual_try() assumes that the dropped virtual columns still exist in ctx->old_table. innodb_update_cols(): Renamed from innodb_update_n_cols(). innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change the return type to bool, and invoke my_error() when detecting an error. innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS. Refactored from innobase_add_one_virtual() and innobase_instant_add_col(). innobase_instant_add_col(): Replace the parameter dfield with type. innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS and all columns from SYS_VIRTUAL. innobase_add_virtual_try(), innobase_drop_virtual_try(): Let the caller invoke innodb_update_cols(). innobase_rename_column_try(): Skip dropped columns. commit_cache_norebuild(): Update table->fts->doc_col. dict_mem_table_col_rename_low(): Skip dropped columns. trx_undo_rec_get_partial_row(): Skip dropped columns. trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly. trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields. Log metadata records consistently. Apparently, the first fields of a clustered index may be updated in an update_undo vector when the index is ID_IND of SYS_FOREIGN, as part of renaming the table during ALTER TABLE. Normally, updates of the PRIMARY KEY should be logged as delete-mark and an insert. row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec(): Use trx_undo_metadata. row_undo_mod_clust_low(): On metadata rollback, roll back the root page too. row_undo_mod_clust(): Relax an assertion. The delete-mark flag was repurposed for ALTER TABLE metadata records. row_rec_to_index_entry_impl(): Add the template parameter mblob and the optional parameter info_bits for specifying the desired new info bits. For the metadata tuple, allow conversion between the original format (ADD COLUMN only) and the generic format (with hidden BLOB). Add the optional parameter "pad" to determine whether the tuple should be padded to the index fields (on ALTER TABLE it should), or whether it should remain at its original size (on rollback). row_build_index_entry_low(): Clean up the code, removing redundant variables and conditions. For instantly dropped columns, generate a dummy value that is NULL, the empty string, or a fixed length of NUL bytes, depending on the type of the dropped column. row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY of a record that contained a dropped column whose value was stored externally, we will be inserting a dummy NULL or empty string value to the field of the dropped column. The externally stored column would eventually be dropped when purge removes the delete-marked record for the old PRIMARY KEY value. btr_index_rec_validate(): Recognize the metadata record. btr_discard_only_page_on_level(): Preserve the generic instant ALTER TABLE metadata. btr_set_instant(): Replaces page_set_instant(). This sets a clustered index root page to the appropriate format, or upgrades from the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format. btr_cur_instant_init_low(): Read and validate the metadata BLOB page before reconstructing the dictionary information based on it. btr_cur_instant_init_metadata(): Do not read any lengths from the metadata record header before reading the BLOB. At this point, we would not actually know how many nullable fields the metadata record contains. btr_cur_instant_root_init(): Initialize n_core_null_bytes in one of two possible ways. btr_cur_trim(): Handle the mblob record. row_metadata_to_tuple(): Convert a metadata record to a data tuple, based on the new info_bits of the metadata record. btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed. Invoke dtuple_convert_big_rec() for metadata records if the record is too large, or if the mblob is not yet marked as externally stored. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): When the last user record is deleted, do not delete the generic instant ALTER TABLE metadata record. Only delete MDEV-11369 instant ADD COLUMN metadata records. btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size. btr_pcur_store_position(): Allow a logically empty page to contain a metadata record for generic ALTER TABLE. REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW. This is for the old instant ADD COLUMN (MDEV-11369) only. REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record, with additional information for dropped or reordered columns. rec_info_bits_valid(): Remove. The only case when this would fail is when the record is the generic ALTER TABLE metadata record. rec_is_alter_metadata(): Check if a record is the metadata record for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function must not be invoked on node pointer records, because the delete-mark flag in those records may be set (it is garbage), and then a debug assertion could fail because index->is_instant() does not necessarily hold. rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata record (not more generic instant ALTER TABLE). rec_get_converted_size_comp_prefix_low(): Assume that the metadata field will be stored externally. In dtuple_convert_big_rec() during the rec_get_converted_size() call, it would not be there yet. rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple. rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(), rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>. With mblob=true, process a record with a metadata BLOB. rec_copy_prefix_to_buf(): Assert that no fields beyond the key and system columns are being copied. Exclude the metadata BLOB field. rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple into a record. row_upd_index_replace_metadata(): Apply an update vector to an alter_metadata tuple. row_log_allocate(): Replace dict_index_t::is_instant() with a more appropriate condition that ignores dict_table_t::instant. Only a table on which the MDEV-11369 ADD COLUMN was performed can "lose its instantness" when it becomes empty. After instant DROP COLUMN or reordering columns, we cannot simply convert the table to the canonical format, because the data dictionary cache and all possibly existing references to it from other client connection threads would have to be adjusted. row_quiesce_write_index_fields(): Do not crash when the table contains an instantly dropped column. Thanks to Thirunarayanan Balathandayuthapani for discussing the design and implementing an initial prototype of this. Thanks to Matthias Leich for testing.
7 years ago
MDEV-14407 Assertion failure during rollback Rollback attempted to dereference DB_ROLL_PTR=0, which cannot possibly be a valid undo log pointer. A safer canonical value would be roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS which is what was chosen in MDEV-12288, corresponding to reset_trx_id. No deterministic test case for the bug was found. The simplest test cases may be related to MDEV-11415, which suppresses undo logging for ALGORITHM=COPY operations. In those operations, in the spirit of MDEV-12288, we should actually have written reset_trx_id instead of using the transaction identifier of the current transaction (and a bogus value of DB_ROLL_PTR=0). However, thanks to MySQL Bug#28432 which I had fixed in MySQL 5.6.8 as part of WL#6255, access to the rebuilt table by earlier-started transactions should actually have been refused with ER_TABLE_DEF_CHANGED. reset_trx_id: Move the definition to data0type.cc and the declaration to data0type.h. btr_cur_ins_lock_and_undo(): When undo logging is disabled, use the safe value that corresponds to reset_trx_id. btr_cur_optimistic_insert(): Validate the DB_TRX_ID,DB_ROLL_PTR before inserting into a clustered index leaf page. ins_node_t::sys_buf[]: Replaces row_id_buf and trx_id_buf and some heap usage. row_ins_alloc_sys_fields(): Init ins_node_t::sys_buf[] to reset_trx_id. row_ins_buf(): Only if undo logging is enabled, copy trx->id to node->sys_buf. Otherwise, rely on the initialization in row_ins_alloc_sys_fields(). row_purge_reset_trx_id(): Invoke mlog_write_string() with reset_trx_id directly. (No functional change.) trx_undo_page_report_modify(): Assert that the DB_ROLL_PTR is not 0. trx_undo_get_undo_rec_low(): Assert that the roll_ptr is valid before trying to dereference it. dict_index_t::is_primary(): Check if the index is the primary key. PageConverter::adjust_cluster_record(): Fix MDEV-15249 Crash in MVCC read after IMPORT TABLESPACE by resetting the system fields to reset_trx_id instead of writing the current transaction ID (which will be committed at the end of the IMPORT TABLESPACE) and DB_ROLL_PTR=0. This can partially be viewed as a follow-up fix of MDEV-12288, because IMPORT should already then have written DB_TRX_ID=0 and DB_ROLL_PTR=1<<55 to prevent unnecessary DB_TRX_ID lookups in subsequent accesses to the table.
8 years ago
10 years ago
MDEV-14407 Assertion failure during rollback Rollback attempted to dereference DB_ROLL_PTR=0, which cannot possibly be a valid undo log pointer. A safer canonical value would be roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS which is what was chosen in MDEV-12288, corresponding to reset_trx_id. No deterministic test case for the bug was found. The simplest test cases may be related to MDEV-11415, which suppresses undo logging for ALGORITHM=COPY operations. In those operations, in the spirit of MDEV-12288, we should actually have written reset_trx_id instead of using the transaction identifier of the current transaction (and a bogus value of DB_ROLL_PTR=0). However, thanks to MySQL Bug#28432 which I had fixed in MySQL 5.6.8 as part of WL#6255, access to the rebuilt table by earlier-started transactions should actually have been refused with ER_TABLE_DEF_CHANGED. reset_trx_id: Move the definition to data0type.cc and the declaration to data0type.h. btr_cur_ins_lock_and_undo(): When undo logging is disabled, use the safe value that corresponds to reset_trx_id. btr_cur_optimistic_insert(): Validate the DB_TRX_ID,DB_ROLL_PTR before inserting into a clustered index leaf page. ins_node_t::sys_buf[]: Replaces row_id_buf and trx_id_buf and some heap usage. row_ins_alloc_sys_fields(): Init ins_node_t::sys_buf[] to reset_trx_id. row_ins_buf(): Only if undo logging is enabled, copy trx->id to node->sys_buf. Otherwise, rely on the initialization in row_ins_alloc_sys_fields(). row_purge_reset_trx_id(): Invoke mlog_write_string() with reset_trx_id directly. (No functional change.) trx_undo_page_report_modify(): Assert that the DB_ROLL_PTR is not 0. trx_undo_get_undo_rec_low(): Assert that the roll_ptr is valid before trying to dereference it. dict_index_t::is_primary(): Check if the index is the primary key. PageConverter::adjust_cluster_record(): Fix MDEV-15249 Crash in MVCC read after IMPORT TABLESPACE by resetting the system fields to reset_trx_id instead of writing the current transaction ID (which will be committed at the end of the IMPORT TABLESPACE) and DB_ROLL_PTR=0. This can partially be viewed as a follow-up fix of MDEV-12288, because IMPORT should already then have written DB_TRX_ID=0 and DB_ROLL_PTR=1<<55 to prevent unnecessary DB_TRX_ID lookups in subsequent accesses to the table.
8 years ago
MDEV-10139 Support for InnoDB SEQUENCE objects We introduce a NO_ROLLBACK flag for InnoDB tables. This flag only works for tables that have a single index. Apart from undo logging, this flag will also prevent locking and the assignment of DB_ROW_ID or DB_TRX_ID, and imply READ UNCOMMITTED isolation. It is assumed that the SQL layer is guaranteeing mutual exclusion. After the initial insert of the single record during CREATE SEQUENCE, InnoDB will be updating the single record in-place. This is crash-safe thanks to the redo log. (That is, after a crash after CREATE SEQUENCE was committed, the effect of sequence operations will be observable fully or not at all.) When it comes to the durability of the updates of SEQUENCE in InnoDB, there is a clear analogy to MDEV-6076 Persistent AUTO_INCREMENT. The updates would be made persistent by the InnoDB redo log flush at transaction commit or rollback (or XA PREPARE), provided that innodb_log_flush_at_trx_commit=1. Similar to AUTO_INCREMENT, it is possible that the update of a SEQUENCE in a middle of transaction becomes durable before the COMMIT/ROLLBACK of the transaction, in case the InnoDB redo log is being flushed as a result of the a commit or rollback of some other transaction, or as a result of a redo log checkpoint that can be initiated at any time by operations that are writing redo log. dict_table_t::no_rollback(): Check if the table does not support rollback. BTR_NO_ROLLBACK: Logging and locking flags for no_rollback() tables. DICT_TF_BITS: Add the NO_ROLLBACK flag. row_ins_step(): Assign 0 to DB_ROW_ID and DB_TRX_ID, and skip any locking for no-rollback tables. There will be only a single row in no-rollback tables (or there must be a proper PRIMARY KEY). row_search_mvcc(): Execute the READ UNCOMMITTED code path for no-rollback tables. ha_innobase::external_lock(), ha_innobase::store_lock(): Block CREATE/DROP SEQUENCE in innodb_read_only mode. This probably has no effect for CREATE SEQUENCE, because already ha_innobase::create() should have been called (and refused) before external_lock() or store_lock() is called. ha_innobase::store_lock(): For CREATE SEQUENCE, do not acquire any InnoDB locks, even though TL_WRITE is being requested. (This is just a performance optimization.) innobase_copy_frm_flags_from_create_info(), row_drop_table_for_mysql(): Disable persistent statistics for no_rollback tables.
9 years ago
MDEV-10139 Support for InnoDB SEQUENCE objects We introduce a NO_ROLLBACK flag for InnoDB tables. This flag only works for tables that have a single index. Apart from undo logging, this flag will also prevent locking and the assignment of DB_ROW_ID or DB_TRX_ID, and imply READ UNCOMMITTED isolation. It is assumed that the SQL layer is guaranteeing mutual exclusion. After the initial insert of the single record during CREATE SEQUENCE, InnoDB will be updating the single record in-place. This is crash-safe thanks to the redo log. (That is, after a crash after CREATE SEQUENCE was committed, the effect of sequence operations will be observable fully or not at all.) When it comes to the durability of the updates of SEQUENCE in InnoDB, there is a clear analogy to MDEV-6076 Persistent AUTO_INCREMENT. The updates would be made persistent by the InnoDB redo log flush at transaction commit or rollback (or XA PREPARE), provided that innodb_log_flush_at_trx_commit=1. Similar to AUTO_INCREMENT, it is possible that the update of a SEQUENCE in a middle of transaction becomes durable before the COMMIT/ROLLBACK of the transaction, in case the InnoDB redo log is being flushed as a result of the a commit or rollback of some other transaction, or as a result of a redo log checkpoint that can be initiated at any time by operations that are writing redo log. dict_table_t::no_rollback(): Check if the table does not support rollback. BTR_NO_ROLLBACK: Logging and locking flags for no_rollback() tables. DICT_TF_BITS: Add the NO_ROLLBACK flag. row_ins_step(): Assign 0 to DB_ROW_ID and DB_TRX_ID, and skip any locking for no-rollback tables. There will be only a single row in no-rollback tables (or there must be a proper PRIMARY KEY). row_search_mvcc(): Execute the READ UNCOMMITTED code path for no-rollback tables. ha_innobase::external_lock(), ha_innobase::store_lock(): Block CREATE/DROP SEQUENCE in innodb_read_only mode. This probably has no effect for CREATE SEQUENCE, because already ha_innobase::create() should have been called (and refused) before external_lock() or store_lock() is called. ha_innobase::store_lock(): For CREATE SEQUENCE, do not acquire any InnoDB locks, even though TL_WRITE is being requested. (This is just a performance optimization.) innobase_copy_frm_flags_from_create_info(), row_drop_table_for_mysql(): Disable persistent statistics for no_rollback tables.
9 years ago
MDEV-10139 Support for InnoDB SEQUENCE objects We introduce a NO_ROLLBACK flag for InnoDB tables. This flag only works for tables that have a single index. Apart from undo logging, this flag will also prevent locking and the assignment of DB_ROW_ID or DB_TRX_ID, and imply READ UNCOMMITTED isolation. It is assumed that the SQL layer is guaranteeing mutual exclusion. After the initial insert of the single record during CREATE SEQUENCE, InnoDB will be updating the single record in-place. This is crash-safe thanks to the redo log. (That is, after a crash after CREATE SEQUENCE was committed, the effect of sequence operations will be observable fully or not at all.) When it comes to the durability of the updates of SEQUENCE in InnoDB, there is a clear analogy to MDEV-6076 Persistent AUTO_INCREMENT. The updates would be made persistent by the InnoDB redo log flush at transaction commit or rollback (or XA PREPARE), provided that innodb_log_flush_at_trx_commit=1. Similar to AUTO_INCREMENT, it is possible that the update of a SEQUENCE in a middle of transaction becomes durable before the COMMIT/ROLLBACK of the transaction, in case the InnoDB redo log is being flushed as a result of the a commit or rollback of some other transaction, or as a result of a redo log checkpoint that can be initiated at any time by operations that are writing redo log. dict_table_t::no_rollback(): Check if the table does not support rollback. BTR_NO_ROLLBACK: Logging and locking flags for no_rollback() tables. DICT_TF_BITS: Add the NO_ROLLBACK flag. row_ins_step(): Assign 0 to DB_ROW_ID and DB_TRX_ID, and skip any locking for no-rollback tables. There will be only a single row in no-rollback tables (or there must be a proper PRIMARY KEY). row_search_mvcc(): Execute the READ UNCOMMITTED code path for no-rollback tables. ha_innobase::external_lock(), ha_innobase::store_lock(): Block CREATE/DROP SEQUENCE in innodb_read_only mode. This probably has no effect for CREATE SEQUENCE, because already ha_innobase::create() should have been called (and refused) before external_lock() or store_lock() is called. ha_innobase::store_lock(): For CREATE SEQUENCE, do not acquire any InnoDB locks, even though TL_WRITE is being requested. (This is just a performance optimization.) innobase_copy_frm_flags_from_create_info(), row_drop_table_for_mysql(): Disable persistent statistics for no_rollback tables.
9 years ago
MDEV-10139 Support for InnoDB SEQUENCE objects We introduce a NO_ROLLBACK flag for InnoDB tables. This flag only works for tables that have a single index. Apart from undo logging, this flag will also prevent locking and the assignment of DB_ROW_ID or DB_TRX_ID, and imply READ UNCOMMITTED isolation. It is assumed that the SQL layer is guaranteeing mutual exclusion. After the initial insert of the single record during CREATE SEQUENCE, InnoDB will be updating the single record in-place. This is crash-safe thanks to the redo log. (That is, after a crash after CREATE SEQUENCE was committed, the effect of sequence operations will be observable fully or not at all.) When it comes to the durability of the updates of SEQUENCE in InnoDB, there is a clear analogy to MDEV-6076 Persistent AUTO_INCREMENT. The updates would be made persistent by the InnoDB redo log flush at transaction commit or rollback (or XA PREPARE), provided that innodb_log_flush_at_trx_commit=1. Similar to AUTO_INCREMENT, it is possible that the update of a SEQUENCE in a middle of transaction becomes durable before the COMMIT/ROLLBACK of the transaction, in case the InnoDB redo log is being flushed as a result of the a commit or rollback of some other transaction, or as a result of a redo log checkpoint that can be initiated at any time by operations that are writing redo log. dict_table_t::no_rollback(): Check if the table does not support rollback. BTR_NO_ROLLBACK: Logging and locking flags for no_rollback() tables. DICT_TF_BITS: Add the NO_ROLLBACK flag. row_ins_step(): Assign 0 to DB_ROW_ID and DB_TRX_ID, and skip any locking for no-rollback tables. There will be only a single row in no-rollback tables (or there must be a proper PRIMARY KEY). row_search_mvcc(): Execute the READ UNCOMMITTED code path for no-rollback tables. ha_innobase::external_lock(), ha_innobase::store_lock(): Block CREATE/DROP SEQUENCE in innodb_read_only mode. This probably has no effect for CREATE SEQUENCE, because already ha_innobase::create() should have been called (and refused) before external_lock() or store_lock() is called. ha_innobase::store_lock(): For CREATE SEQUENCE, do not acquire any InnoDB locks, even though TL_WRITE is being requested. (This is just a performance optimization.) innobase_copy_frm_flags_from_create_info(), row_drop_table_for_mysql(): Disable persistent statistics for no_rollback tables.
9 years ago
MDEV-14407 Assertion failure during rollback Rollback attempted to dereference DB_ROLL_PTR=0, which cannot possibly be a valid undo log pointer. A safer canonical value would be roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS which is what was chosen in MDEV-12288, corresponding to reset_trx_id. No deterministic test case for the bug was found. The simplest test cases may be related to MDEV-11415, which suppresses undo logging for ALGORITHM=COPY operations. In those operations, in the spirit of MDEV-12288, we should actually have written reset_trx_id instead of using the transaction identifier of the current transaction (and a bogus value of DB_ROLL_PTR=0). However, thanks to MySQL Bug#28432 which I had fixed in MySQL 5.6.8 as part of WL#6255, access to the rebuilt table by earlier-started transactions should actually have been refused with ER_TABLE_DEF_CHANGED. reset_trx_id: Move the definition to data0type.cc and the declaration to data0type.h. btr_cur_ins_lock_and_undo(): When undo logging is disabled, use the safe value that corresponds to reset_trx_id. btr_cur_optimistic_insert(): Validate the DB_TRX_ID,DB_ROLL_PTR before inserting into a clustered index leaf page. ins_node_t::sys_buf[]: Replaces row_id_buf and trx_id_buf and some heap usage. row_ins_alloc_sys_fields(): Init ins_node_t::sys_buf[] to reset_trx_id. row_ins_buf(): Only if undo logging is enabled, copy trx->id to node->sys_buf. Otherwise, rely on the initialization in row_ins_alloc_sys_fields(). row_purge_reset_trx_id(): Invoke mlog_write_string() with reset_trx_id directly. (No functional change.) trx_undo_page_report_modify(): Assert that the DB_ROLL_PTR is not 0. trx_undo_get_undo_rec_low(): Assert that the roll_ptr is valid before trying to dereference it. dict_index_t::is_primary(): Check if the index is the primary key. PageConverter::adjust_cluster_record(): Fix MDEV-15249 Crash in MVCC read after IMPORT TABLESPACE by resetting the system fields to reset_trx_id instead of writing the current transaction ID (which will be committed at the end of the IMPORT TABLESPACE) and DB_ROLL_PTR=0. This can partially be viewed as a follow-up fix of MDEV-12288, because IMPORT should already then have written DB_TRX_ID=0 and DB_ROLL_PTR=1<<55 to prevent unnecessary DB_TRX_ID lookups in subsequent accesses to the table.
8 years ago
MDEV-14407 Assertion failure during rollback Rollback attempted to dereference DB_ROLL_PTR=0, which cannot possibly be a valid undo log pointer. A safer canonical value would be roll_ptr_t(1) << ROLL_PTR_INSERT_FLAG_POS which is what was chosen in MDEV-12288, corresponding to reset_trx_id. No deterministic test case for the bug was found. The simplest test cases may be related to MDEV-11415, which suppresses undo logging for ALGORITHM=COPY operations. In those operations, in the spirit of MDEV-12288, we should actually have written reset_trx_id instead of using the transaction identifier of the current transaction (and a bogus value of DB_ROLL_PTR=0). However, thanks to MySQL Bug#28432 which I had fixed in MySQL 5.6.8 as part of WL#6255, access to the rebuilt table by earlier-started transactions should actually have been refused with ER_TABLE_DEF_CHANGED. reset_trx_id: Move the definition to data0type.cc and the declaration to data0type.h. btr_cur_ins_lock_and_undo(): When undo logging is disabled, use the safe value that corresponds to reset_trx_id. btr_cur_optimistic_insert(): Validate the DB_TRX_ID,DB_ROLL_PTR before inserting into a clustered index leaf page. ins_node_t::sys_buf[]: Replaces row_id_buf and trx_id_buf and some heap usage. row_ins_alloc_sys_fields(): Init ins_node_t::sys_buf[] to reset_trx_id. row_ins_buf(): Only if undo logging is enabled, copy trx->id to node->sys_buf. Otherwise, rely on the initialization in row_ins_alloc_sys_fields(). row_purge_reset_trx_id(): Invoke mlog_write_string() with reset_trx_id directly. (No functional change.) trx_undo_page_report_modify(): Assert that the DB_ROLL_PTR is not 0. trx_undo_get_undo_rec_low(): Assert that the roll_ptr is valid before trying to dereference it. dict_index_t::is_primary(): Check if the index is the primary key. PageConverter::adjust_cluster_record(): Fix MDEV-15249 Crash in MVCC read after IMPORT TABLESPACE by resetting the system fields to reset_trx_id instead of writing the current transaction ID (which will be committed at the end of the IMPORT TABLESPACE) and DB_ROLL_PTR=0. This can partially be viewed as a follow-up fix of MDEV-12288, because IMPORT should already then have written DB_TRX_ID=0 and DB_ROLL_PTR=1<<55 to prevent unnecessary DB_TRX_ID lookups in subsequent accesses to the table.
8 years ago
MDEV-10139 Support for InnoDB SEQUENCE objects We introduce a NO_ROLLBACK flag for InnoDB tables. This flag only works for tables that have a single index. Apart from undo logging, this flag will also prevent locking and the assignment of DB_ROW_ID or DB_TRX_ID, and imply READ UNCOMMITTED isolation. It is assumed that the SQL layer is guaranteeing mutual exclusion. After the initial insert of the single record during CREATE SEQUENCE, InnoDB will be updating the single record in-place. This is crash-safe thanks to the redo log. (That is, after a crash after CREATE SEQUENCE was committed, the effect of sequence operations will be observable fully or not at all.) When it comes to the durability of the updates of SEQUENCE in InnoDB, there is a clear analogy to MDEV-6076 Persistent AUTO_INCREMENT. The updates would be made persistent by the InnoDB redo log flush at transaction commit or rollback (or XA PREPARE), provided that innodb_log_flush_at_trx_commit=1. Similar to AUTO_INCREMENT, it is possible that the update of a SEQUENCE in a middle of transaction becomes durable before the COMMIT/ROLLBACK of the transaction, in case the InnoDB redo log is being flushed as a result of the a commit or rollback of some other transaction, or as a result of a redo log checkpoint that can be initiated at any time by operations that are writing redo log. dict_table_t::no_rollback(): Check if the table does not support rollback. BTR_NO_ROLLBACK: Logging and locking flags for no_rollback() tables. DICT_TF_BITS: Add the NO_ROLLBACK flag. row_ins_step(): Assign 0 to DB_ROW_ID and DB_TRX_ID, and skip any locking for no-rollback tables. There will be only a single row in no-rollback tables (or there must be a proper PRIMARY KEY). row_search_mvcc(): Execute the READ UNCOMMITTED code path for no-rollback tables. ha_innobase::external_lock(), ha_innobase::store_lock(): Block CREATE/DROP SEQUENCE in innodb_read_only mode. This probably has no effect for CREATE SEQUENCE, because already ha_innobase::create() should have been called (and refused) before external_lock() or store_lock() is called. ha_innobase::store_lock(): For CREATE SEQUENCE, do not acquire any InnoDB locks, even though TL_WRITE is being requested. (This is just a performance optimization.) innobase_copy_frm_flags_from_create_info(), row_drop_table_for_mysql(): Disable persistent statistics for no_rollback tables.
9 years ago
  1. /*****************************************************************************
  2. Copyright (c) 1996, 2016, Oracle and/or its affiliates. All Rights Reserved.
  3. Copyright (c) 2016, 2021, MariaDB Corporation.
  4. This program is free software; you can redistribute it and/or modify it under
  5. the terms of the GNU General Public License as published by the Free Software
  6. Foundation; version 2 of the License.
  7. This program is distributed in the hope that it will be useful, but WITHOUT
  8. ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
  9. FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
  10. You should have received a copy of the GNU General Public License along with
  11. this program; if not, write to the Free Software Foundation, Inc.,
  12. 51 Franklin Street, Fifth Floor, Boston, MA 02110-1335 USA
  13. *****************************************************************************/
  14. /**************************************************//**
  15. @file row/row0ins.cc
  16. Insert into a table
  17. Created 4/20/1996 Heikki Tuuri
  18. *******************************************************/
  19. #include "row0ins.h"
  20. #include "dict0dict.h"
  21. #include "trx0rec.h"
  22. #include "trx0undo.h"
  23. #include "btr0btr.h"
  24. #include "btr0cur.h"
  25. #include "mach0data.h"
  26. #include "ibuf0ibuf.h"
  27. #include "que0que.h"
  28. #include "row0upd.h"
  29. #include "row0sel.h"
  30. #include "row0log.h"
  31. #include "rem0cmp.h"
  32. #include "lock0lock.h"
  33. #include "log0log.h"
  34. #include "eval0eval.h"
  35. #include "data0data.h"
  36. #include "buf0lru.h"
  37. #include "fts0fts.h"
  38. #include "fts0types.h"
  39. #ifdef WITH_WSREP
  40. #include "wsrep_mysqld.h"
  41. #endif /* WITH_WSREP */
  42. /*************************************************************************
  43. IMPORTANT NOTE: Any operation that generates redo MUST check that there
  44. is enough space in the redo log before for that operation. This is
  45. done by calling log_free_check(). The reason for checking the
  46. availability of the redo log space before the start of the operation is
  47. that we MUST not hold any synchonization objects when performing the
  48. check.
  49. If you make a change in this module make sure that no codepath is
  50. introduced where a call to log_free_check() is bypassed. */
  51. /** Create an row template for each index of a table. */
  52. static void ins_node_create_entry_list(ins_node_t *node)
  53. {
  54. node->entry_list.reserve(UT_LIST_GET_LEN(node->table->indexes));
  55. for (dict_index_t *index= dict_table_get_first_index(node->table); index;
  56. index= dict_table_get_next_index(index))
  57. {
  58. /* Corrupted or incomplete secondary indexes will be filtered out in
  59. row_ins(). */
  60. dtuple_t *entry= index->online_status >= ONLINE_INDEX_ABORTED
  61. ? dtuple_create(node->entry_sys_heap, 0)
  62. : row_build_index_entry_low(node->row, NULL, index, node->entry_sys_heap,
  63. ROW_BUILD_FOR_INSERT);
  64. node->entry_list.push_back(entry);
  65. }
  66. }
  67. /*****************************************************************//**
  68. Adds system field buffers to a row. */
  69. static
  70. void
  71. row_ins_alloc_sys_fields(
  72. /*=====================*/
  73. ins_node_t* node) /*!< in: insert node */
  74. {
  75. dtuple_t* row;
  76. dict_table_t* table;
  77. const dict_col_t* col;
  78. dfield_t* dfield;
  79. row = node->row;
  80. table = node->table;
  81. ut_ad(dtuple_get_n_fields(row) == dict_table_get_n_cols(table));
  82. /* allocate buffer to hold the needed system created hidden columns. */
  83. compile_time_assert(DATA_ROW_ID_LEN
  84. + DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN
  85. == sizeof node->sys_buf);
  86. memset(node->sys_buf, 0, sizeof node->sys_buf);
  87. /* Assign DB_ROLL_PTR to 1 << ROLL_PTR_INSERT_FLAG_POS */
  88. node->sys_buf[DATA_ROW_ID_LEN + DATA_TRX_ID_LEN] = 0x80;
  89. ut_ad(!memcmp(node->sys_buf + DATA_ROW_ID_LEN, reset_trx_id,
  90. sizeof reset_trx_id));
  91. /* 1. Populate row-id */
  92. col = dict_table_get_sys_col(table, DATA_ROW_ID);
  93. dfield = dtuple_get_nth_field(row, dict_col_get_no(col));
  94. dfield_set_data(dfield, node->sys_buf, DATA_ROW_ID_LEN);
  95. /* 2. Populate trx id */
  96. col = dict_table_get_sys_col(table, DATA_TRX_ID);
  97. dfield = dtuple_get_nth_field(row, dict_col_get_no(col));
  98. dfield_set_data(dfield, &node->sys_buf[DATA_ROW_ID_LEN],
  99. DATA_TRX_ID_LEN);
  100. col = dict_table_get_sys_col(table, DATA_ROLL_PTR);
  101. dfield = dtuple_get_nth_field(row, dict_col_get_no(col));
  102. dfield_set_data(dfield, &node->sys_buf[DATA_ROW_ID_LEN
  103. + DATA_TRX_ID_LEN],
  104. DATA_ROLL_PTR_LEN);
  105. }
  106. /*********************************************************************//**
  107. Sets a new row to insert for an INS_DIRECT node. This function is only used
  108. if we have constructed the row separately, which is a rare case; this
  109. function is quite slow. */
  110. void
  111. ins_node_set_new_row(
  112. /*=================*/
  113. ins_node_t* node, /*!< in: insert node */
  114. dtuple_t* row) /*!< in: new row (or first row) for the node */
  115. {
  116. node->state = INS_NODE_SET_IX_LOCK;
  117. node->index = NULL;
  118. node->entry_list.clear();
  119. node->entry = node->entry_list.end();
  120. node->row = row;
  121. mem_heap_empty(node->entry_sys_heap);
  122. /* Create templates for index entries */
  123. ins_node_create_entry_list(node);
  124. /* Allocate from entry_sys_heap buffers for sys fields */
  125. row_ins_alloc_sys_fields(node);
  126. /* As we allocated a new trx id buf, the trx id should be written
  127. there again: */
  128. node->trx_id = 0;
  129. }
  130. /*******************************************************************//**
  131. Does an insert operation by updating a delete-marked existing record
  132. in the index. This situation can occur if the delete-marked record is
  133. kept in the index for consistent reads.
  134. @return DB_SUCCESS or error code */
  135. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  136. dberr_t
  137. row_ins_sec_index_entry_by_modify(
  138. /*==============================*/
  139. ulint flags, /*!< in: undo logging and locking flags */
  140. ulint mode, /*!< in: BTR_MODIFY_LEAF or BTR_MODIFY_TREE,
  141. depending on whether mtr holds just a leaf
  142. latch or also a tree latch */
  143. btr_cur_t* cursor, /*!< in: B-tree cursor */
  144. rec_offs** offsets,/*!< in/out: offsets on cursor->page_cur.rec */
  145. mem_heap_t* offsets_heap,
  146. /*!< in/out: memory heap that can be emptied */
  147. mem_heap_t* heap, /*!< in/out: memory heap */
  148. const dtuple_t* entry, /*!< in: index entry to insert */
  149. que_thr_t* thr, /*!< in: query thread */
  150. mtr_t* mtr) /*!< in: mtr; must be committed before
  151. latching any further pages */
  152. {
  153. big_rec_t* dummy_big_rec;
  154. upd_t* update;
  155. rec_t* rec;
  156. dberr_t err;
  157. rec = btr_cur_get_rec(cursor);
  158. ut_ad(!dict_index_is_clust(cursor->index));
  159. ut_ad(rec_offs_validate(rec, cursor->index, *offsets));
  160. ut_ad(!entry->info_bits);
  161. /* We know that in the alphabetical ordering, entry and rec are
  162. identified. But in their binary form there may be differences if
  163. there are char fields in them. Therefore we have to calculate the
  164. difference. */
  165. update = row_upd_build_sec_rec_difference_binary(
  166. rec, cursor->index, *offsets, entry, heap);
  167. if (!rec_get_deleted_flag(rec, rec_offs_comp(*offsets))) {
  168. /* We should never insert in place of a record that
  169. has not been delete-marked. The only exception is when
  170. online CREATE INDEX copied the changes that we already
  171. made to the clustered index, and completed the
  172. secondary index creation before we got here. In this
  173. case, the change would already be there. The CREATE
  174. INDEX should be waiting for a MySQL meta-data lock
  175. upgrade at least until this INSERT or UPDATE
  176. returns. After that point, set_committed(true)
  177. would be invoked in commit_inplace_alter_table(). */
  178. ut_a(update->n_fields == 0);
  179. ut_a(!cursor->index->is_committed());
  180. ut_ad(!dict_index_is_online_ddl(cursor->index));
  181. return(DB_SUCCESS);
  182. }
  183. if (mode == BTR_MODIFY_LEAF) {
  184. /* Try an optimistic updating of the record, keeping changes
  185. within the page */
  186. /* TODO: pass only *offsets */
  187. err = btr_cur_optimistic_update(
  188. flags | BTR_KEEP_SYS_FLAG, cursor,
  189. offsets, &offsets_heap, update, 0, thr,
  190. thr_get_trx(thr)->id, mtr);
  191. switch (err) {
  192. case DB_OVERFLOW:
  193. case DB_UNDERFLOW:
  194. case DB_ZIP_OVERFLOW:
  195. err = DB_FAIL;
  196. default:
  197. break;
  198. }
  199. } else {
  200. ut_a(mode == BTR_MODIFY_TREE);
  201. if (buf_LRU_buf_pool_running_out()) {
  202. return(DB_LOCK_TABLE_FULL);
  203. }
  204. err = btr_cur_pessimistic_update(
  205. flags | BTR_KEEP_SYS_FLAG, cursor,
  206. offsets, &offsets_heap,
  207. heap, &dummy_big_rec, update, 0,
  208. thr, thr_get_trx(thr)->id, mtr);
  209. ut_ad(!dummy_big_rec);
  210. }
  211. return(err);
  212. }
  213. /*******************************************************************//**
  214. Does an insert operation by delete unmarking and updating a delete marked
  215. existing record in the index. This situation can occur if the delete marked
  216. record is kept in the index for consistent reads.
  217. @return DB_SUCCESS, DB_FAIL, or error code */
  218. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  219. dberr_t
  220. row_ins_clust_index_entry_by_modify(
  221. /*================================*/
  222. btr_pcur_t* pcur, /*!< in/out: a persistent cursor pointing
  223. to the clust_rec that is being modified. */
  224. ulint flags, /*!< in: undo logging and locking flags */
  225. ulint mode, /*!< in: BTR_MODIFY_LEAF or BTR_MODIFY_TREE,
  226. depending on whether mtr holds just a leaf
  227. latch or also a tree latch */
  228. rec_offs** offsets,/*!< out: offsets on cursor->page_cur.rec */
  229. mem_heap_t** offsets_heap,
  230. /*!< in/out: pointer to memory heap that can
  231. be emptied, or NULL */
  232. mem_heap_t* heap, /*!< in/out: memory heap */
  233. const dtuple_t* entry, /*!< in: index entry to insert */
  234. que_thr_t* thr, /*!< in: query thread */
  235. mtr_t* mtr) /*!< in: mtr; must be committed before
  236. latching any further pages */
  237. {
  238. const rec_t* rec;
  239. upd_t* update;
  240. dberr_t err = DB_SUCCESS;
  241. btr_cur_t* cursor = btr_pcur_get_btr_cur(pcur);
  242. TABLE* mysql_table = NULL;
  243. ut_ad(dict_index_is_clust(cursor->index));
  244. rec = btr_cur_get_rec(cursor);
  245. ut_ad(rec_get_deleted_flag(rec,
  246. dict_table_is_comp(cursor->index->table)));
  247. /* In delete-marked records, DB_TRX_ID must
  248. always refer to an existing undo log record. */
  249. ut_ad(rec_get_trx_id(rec, cursor->index));
  250. /* Build an update vector containing all the fields to be modified;
  251. NOTE that this vector may NOT contain system columns trx_id or
  252. roll_ptr */
  253. if (thr->prebuilt != NULL) {
  254. mysql_table = thr->prebuilt->m_mysql_table;
  255. ut_ad(thr->prebuilt->trx == thr_get_trx(thr));
  256. }
  257. update = row_upd_build_difference_binary(
  258. cursor->index, entry, rec, NULL, true,
  259. thr_get_trx(thr), heap, mysql_table, &err);
  260. if (err != DB_SUCCESS) {
  261. return(err);
  262. }
  263. if (mode != BTR_MODIFY_TREE) {
  264. ut_ad((mode & ulint(~BTR_ALREADY_S_LATCHED))
  265. == BTR_MODIFY_LEAF);
  266. /* Try optimistic updating of the record, keeping changes
  267. within the page */
  268. err = btr_cur_optimistic_update(
  269. flags, cursor, offsets, offsets_heap, update, 0, thr,
  270. thr_get_trx(thr)->id, mtr);
  271. switch (err) {
  272. case DB_OVERFLOW:
  273. case DB_UNDERFLOW:
  274. case DB_ZIP_OVERFLOW:
  275. err = DB_FAIL;
  276. default:
  277. break;
  278. }
  279. } else {
  280. if (buf_LRU_buf_pool_running_out()) {
  281. return(DB_LOCK_TABLE_FULL);
  282. }
  283. big_rec_t* big_rec = NULL;
  284. err = btr_cur_pessimistic_update(
  285. flags | BTR_KEEP_POS_FLAG,
  286. cursor, offsets, offsets_heap, heap,
  287. &big_rec, update, 0, thr, thr_get_trx(thr)->id, mtr);
  288. if (big_rec) {
  289. ut_a(err == DB_SUCCESS);
  290. DEBUG_SYNC_C("before_row_ins_upd_extern");
  291. err = btr_store_big_rec_extern_fields(
  292. pcur, *offsets, big_rec, mtr,
  293. BTR_STORE_INSERT_UPDATE);
  294. DEBUG_SYNC_C("after_row_ins_upd_extern");
  295. dtuple_big_rec_free(big_rec);
  296. }
  297. }
  298. return(err);
  299. }
  300. /*********************************************************************//**
  301. Returns TRUE if in a cascaded update/delete an ancestor node of node
  302. updates (not DELETE, but UPDATE) table.
  303. @return TRUE if an ancestor updates table */
  304. static
  305. ibool
  306. row_ins_cascade_ancestor_updates_table(
  307. /*===================================*/
  308. que_node_t* node, /*!< in: node in a query graph */
  309. dict_table_t* table) /*!< in: table */
  310. {
  311. que_node_t* parent;
  312. for (parent = que_node_get_parent(node);
  313. que_node_get_type(parent) == QUE_NODE_UPDATE;
  314. parent = que_node_get_parent(parent)) {
  315. upd_node_t* upd_node;
  316. upd_node = static_cast<upd_node_t*>(parent);
  317. if (upd_node->table == table && !upd_node->is_delete) {
  318. return(TRUE);
  319. }
  320. }
  321. return(FALSE);
  322. }
  323. /*********************************************************************//**
  324. Returns the number of ancestor UPDATE or DELETE nodes of a
  325. cascaded update/delete node.
  326. @return number of ancestors */
  327. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  328. ulint
  329. row_ins_cascade_n_ancestors(
  330. /*========================*/
  331. que_node_t* node) /*!< in: node in a query graph */
  332. {
  333. que_node_t* parent;
  334. ulint n_ancestors = 0;
  335. for (parent = que_node_get_parent(node);
  336. que_node_get_type(parent) == QUE_NODE_UPDATE;
  337. parent = que_node_get_parent(parent)) {
  338. n_ancestors++;
  339. }
  340. return(n_ancestors);
  341. }
  342. /******************************************************************//**
  343. Calculates the update vector node->cascade->update for a child table in
  344. a cascaded update.
  345. @return whether any FULLTEXT INDEX is affected */
  346. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  347. bool
  348. row_ins_cascade_calc_update_vec(
  349. /*============================*/
  350. upd_node_t* node, /*!< in: update node of the parent
  351. table */
  352. dict_foreign_t* foreign, /*!< in: foreign key constraint whose
  353. type is != 0 */
  354. mem_heap_t* heap, /*!< in: memory heap to use as
  355. temporary storage */
  356. trx_t* trx) /*!< in: update transaction */
  357. {
  358. upd_node_t* cascade = node->cascade_node;
  359. dict_table_t* table = foreign->foreign_table;
  360. dict_index_t* index = foreign->foreign_index;
  361. upd_t* update;
  362. dict_table_t* parent_table;
  363. dict_index_t* parent_index;
  364. upd_t* parent_update;
  365. ulint n_fields_updated;
  366. ulint parent_field_no;
  367. ulint i;
  368. ulint j;
  369. bool doc_id_updated = false;
  370. ulint doc_id_pos = 0;
  371. doc_id_t new_doc_id = FTS_NULL_DOC_ID;
  372. ulint prefix_col;
  373. ut_a(cascade);
  374. ut_a(table);
  375. ut_a(index);
  376. /* Calculate the appropriate update vector which will set the fields
  377. in the child index record to the same value (possibly padded with
  378. spaces if the column is a fixed length CHAR or FIXBINARY column) as
  379. the referenced index record will get in the update. */
  380. parent_table = node->table;
  381. ut_a(parent_table == foreign->referenced_table);
  382. parent_index = foreign->referenced_index;
  383. parent_update = node->update;
  384. update = cascade->update;
  385. update->info_bits = 0;
  386. n_fields_updated = 0;
  387. bool affects_fulltext = foreign->affects_fulltext();
  388. if (table->fts) {
  389. doc_id_pos = dict_table_get_nth_col_pos(
  390. table, table->fts->doc_col, &prefix_col);
  391. }
  392. for (i = 0; i < foreign->n_fields; i++) {
  393. parent_field_no = dict_table_get_nth_col_pos(
  394. parent_table,
  395. dict_index_get_nth_col_no(parent_index, i),
  396. &prefix_col);
  397. for (j = 0; j < parent_update->n_fields; j++) {
  398. const upd_field_t* parent_ufield
  399. = &parent_update->fields[j];
  400. if (parent_ufield->field_no == parent_field_no) {
  401. ulint min_size;
  402. const dict_col_t* col;
  403. ulint ufield_len;
  404. upd_field_t* ufield;
  405. col = dict_index_get_nth_col(index, i);
  406. /* A field in the parent index record is
  407. updated. Let us make the update vector
  408. field for the child table. */
  409. ufield = update->fields + n_fields_updated;
  410. ufield->field_no
  411. = dict_table_get_nth_col_pos(
  412. table, dict_col_get_no(col),
  413. &prefix_col);
  414. ufield->orig_len = 0;
  415. ufield->exp = NULL;
  416. ufield->new_val = parent_ufield->new_val;
  417. dfield_get_type(&ufield->new_val)->prtype |=
  418. col->prtype & DATA_VERSIONED;
  419. ufield_len = dfield_get_len(&ufield->new_val);
  420. /* Clear the "external storage" flag */
  421. dfield_set_len(&ufield->new_val, ufield_len);
  422. /* Do not allow a NOT NULL column to be
  423. updated as NULL */
  424. if (dfield_is_null(&ufield->new_val)
  425. && (col->prtype & DATA_NOT_NULL)) {
  426. goto err_exit;
  427. }
  428. /* If the new value would not fit in the
  429. column, do not allow the update */
  430. if (!dfield_is_null(&ufield->new_val)
  431. && dtype_get_at_most_n_mbchars(
  432. col->prtype,
  433. col->mbminlen, col->mbmaxlen,
  434. col->len,
  435. ufield_len,
  436. static_cast<char*>(
  437. dfield_get_data(
  438. &ufield->new_val)))
  439. < ufield_len) {
  440. goto err_exit;
  441. }
  442. /* If the parent column type has a different
  443. length than the child column type, we may
  444. need to pad with spaces the new value of the
  445. child column */
  446. min_size = dict_col_get_min_size(col);
  447. /* Because UNIV_SQL_NULL (the marker
  448. of SQL NULL values) exceeds all possible
  449. values of min_size, the test below will
  450. not hold for SQL NULL columns. */
  451. if (min_size > ufield_len) {
  452. byte* pad;
  453. ulint pad_len;
  454. byte* padded_data;
  455. ulint mbminlen;
  456. padded_data = static_cast<byte*>(
  457. mem_heap_alloc(
  458. heap, min_size));
  459. pad = padded_data + ufield_len;
  460. pad_len = min_size - ufield_len;
  461. memcpy(padded_data,
  462. dfield_get_data(&ufield
  463. ->new_val),
  464. ufield_len);
  465. mbminlen = dict_col_get_mbminlen(col);
  466. ut_ad(!(ufield_len % mbminlen));
  467. ut_ad(!(min_size % mbminlen));
  468. if (mbminlen == 1
  469. && dtype_get_charset_coll(
  470. col->prtype)
  471. == DATA_MYSQL_BINARY_CHARSET_COLL) {
  472. /* Do not pad BINARY columns */
  473. goto err_exit;
  474. }
  475. row_mysql_pad_col(mbminlen,
  476. pad, pad_len);
  477. dfield_set_data(&ufield->new_val,
  478. padded_data, min_size);
  479. }
  480. /* If Doc ID is updated, check whether the
  481. Doc ID is valid */
  482. if (table->fts
  483. && ufield->field_no == doc_id_pos) {
  484. doc_id_t n_doc_id;
  485. n_doc_id =
  486. table->fts->cache->next_doc_id;
  487. new_doc_id = fts_read_doc_id(
  488. static_cast<const byte*>(
  489. dfield_get_data(
  490. &ufield->new_val)));
  491. affects_fulltext = true;
  492. doc_id_updated = true;
  493. if (new_doc_id <= 0) {
  494. ib::error() << "FTS Doc ID"
  495. " must be larger than"
  496. " 0";
  497. goto err_exit;
  498. }
  499. if (new_doc_id < n_doc_id) {
  500. ib::error() << "FTS Doc ID"
  501. " must be larger than "
  502. << n_doc_id - 1
  503. << " for table "
  504. << table->name;
  505. goto err_exit;
  506. }
  507. }
  508. n_fields_updated++;
  509. }
  510. }
  511. }
  512. if (affects_fulltext) {
  513. ut_ad(table->fts);
  514. if (DICT_TF2_FLAG_IS_SET(table, DICT_TF2_FTS_HAS_DOC_ID)) {
  515. doc_id_t doc_id;
  516. doc_id_t* next_doc_id;
  517. upd_field_t* ufield;
  518. next_doc_id = static_cast<doc_id_t*>(mem_heap_alloc(
  519. heap, sizeof(doc_id_t)));
  520. ut_ad(!doc_id_updated);
  521. ufield = update->fields + n_fields_updated;
  522. fts_get_next_doc_id(table, next_doc_id);
  523. doc_id = fts_update_doc_id(table, ufield, next_doc_id);
  524. n_fields_updated++;
  525. fts_trx_add_op(trx, table, doc_id, FTS_INSERT, NULL);
  526. } else {
  527. if (doc_id_updated) {
  528. ut_ad(new_doc_id);
  529. fts_trx_add_op(trx, table, new_doc_id,
  530. FTS_INSERT, NULL);
  531. } else {
  532. ib::error() << "FTS Doc ID must be updated"
  533. " along with FTS indexed column for"
  534. " table " << table->name;
  535. err_exit:
  536. n_fields_updated = ULINT_UNDEFINED;
  537. }
  538. }
  539. }
  540. update->n_fields = n_fields_updated;
  541. return affects_fulltext;
  542. }
  543. /*********************************************************************//**
  544. Set detailed error message associated with foreign key errors for
  545. the given transaction. */
  546. static
  547. void
  548. row_ins_set_detailed(
  549. /*=================*/
  550. trx_t* trx, /*!< in: transaction */
  551. dict_foreign_t* foreign) /*!< in: foreign key constraint */
  552. {
  553. ut_ad(!srv_read_only_mode);
  554. mutex_enter(&srv_misc_tmpfile_mutex);
  555. rewind(srv_misc_tmpfile);
  556. if (os_file_set_eof(srv_misc_tmpfile)) {
  557. ut_print_name(srv_misc_tmpfile, trx,
  558. foreign->foreign_table_name);
  559. std::string fk_str = dict_print_info_on_foreign_key_in_create_format(
  560. trx, foreign, FALSE);
  561. fputs(fk_str.c_str(), srv_misc_tmpfile);
  562. trx_set_detailed_error_from_file(trx, srv_misc_tmpfile);
  563. } else {
  564. trx_set_detailed_error(trx, "temp file operation failed");
  565. }
  566. mutex_exit(&srv_misc_tmpfile_mutex);
  567. }
  568. /*********************************************************************//**
  569. Acquires dict_foreign_err_mutex, rewinds dict_foreign_err_file
  570. and displays information about the given transaction.
  571. The caller must release dict_foreign_err_mutex. */
  572. static
  573. void
  574. row_ins_foreign_trx_print(
  575. /*======================*/
  576. trx_t* trx) /*!< in: transaction */
  577. {
  578. ulint n_rec_locks;
  579. ulint n_trx_locks;
  580. ulint heap_size;
  581. ut_ad(!srv_read_only_mode);
  582. lock_mutex_enter();
  583. n_rec_locks = lock_number_of_rows_locked(&trx->lock);
  584. n_trx_locks = UT_LIST_GET_LEN(trx->lock.trx_locks);
  585. heap_size = mem_heap_get_size(trx->lock.lock_heap);
  586. lock_mutex_exit();
  587. mutex_enter(&dict_foreign_err_mutex);
  588. rewind(dict_foreign_err_file);
  589. ut_print_timestamp(dict_foreign_err_file);
  590. fputs(" Transaction:\n", dict_foreign_err_file);
  591. trx_print_low(dict_foreign_err_file, trx, 600,
  592. n_rec_locks, n_trx_locks, heap_size);
  593. ut_ad(mutex_own(&dict_foreign_err_mutex));
  594. }
  595. /*********************************************************************//**
  596. Reports a foreign key error associated with an update or a delete of a
  597. parent table index entry. */
  598. static
  599. void
  600. row_ins_foreign_report_err(
  601. /*=======================*/
  602. const char* errstr, /*!< in: error string from the viewpoint
  603. of the parent table */
  604. que_thr_t* thr, /*!< in: query thread whose run_node
  605. is an update node */
  606. dict_foreign_t* foreign, /*!< in: foreign key constraint */
  607. const rec_t* rec, /*!< in: a matching index record in the
  608. child table */
  609. const dtuple_t* entry) /*!< in: index entry in the parent
  610. table */
  611. {
  612. std::string fk_str;
  613. if (srv_read_only_mode) {
  614. return;
  615. }
  616. FILE* ef = dict_foreign_err_file;
  617. trx_t* trx = thr_get_trx(thr);
  618. row_ins_set_detailed(trx, foreign);
  619. row_ins_foreign_trx_print(trx);
  620. fputs("Foreign key constraint fails for table ", ef);
  621. ut_print_name(ef, trx, foreign->foreign_table_name);
  622. fputs(":\n", ef);
  623. fk_str = dict_print_info_on_foreign_key_in_create_format(trx, foreign,
  624. TRUE);
  625. fputs(fk_str.c_str(), ef);
  626. putc('\n', ef);
  627. fputs(errstr, ef);
  628. fprintf(ef, " in parent table, in index %s",
  629. foreign->referenced_index->name());
  630. if (entry) {
  631. fputs(" tuple:\n", ef);
  632. dtuple_print(ef, entry);
  633. }
  634. fputs("\nBut in child table ", ef);
  635. ut_print_name(ef, trx, foreign->foreign_table_name);
  636. fprintf(ef, ", in index %s", foreign->foreign_index->name());
  637. if (rec) {
  638. fputs(", there is a record:\n", ef);
  639. rec_print(ef, rec, foreign->foreign_index);
  640. } else {
  641. fputs(", the record is not available\n", ef);
  642. }
  643. putc('\n', ef);
  644. mutex_exit(&dict_foreign_err_mutex);
  645. }
  646. /*********************************************************************//**
  647. Reports a foreign key error to dict_foreign_err_file when we are trying
  648. to add an index entry to a child table. Note that the adding may be the result
  649. of an update, too. */
  650. static
  651. void
  652. row_ins_foreign_report_add_err(
  653. /*===========================*/
  654. trx_t* trx, /*!< in: transaction */
  655. dict_foreign_t* foreign, /*!< in: foreign key constraint */
  656. const rec_t* rec, /*!< in: a record in the parent table:
  657. it does not match entry because we
  658. have an error! */
  659. const dtuple_t* entry) /*!< in: index entry to insert in the
  660. child table */
  661. {
  662. std::string fk_str;
  663. if (srv_read_only_mode) {
  664. return;
  665. }
  666. FILE* ef = dict_foreign_err_file;
  667. row_ins_set_detailed(trx, foreign);
  668. row_ins_foreign_trx_print(trx);
  669. fputs("Foreign key constraint fails for table ", ef);
  670. ut_print_name(ef, trx, foreign->foreign_table_name);
  671. fputs(":\n", ef);
  672. fk_str = dict_print_info_on_foreign_key_in_create_format(trx, foreign,
  673. TRUE);
  674. fputs(fk_str.c_str(), ef);
  675. if (foreign->foreign_index) {
  676. fprintf(ef, " in parent table, in index %s",
  677. foreign->foreign_index->name());
  678. } else {
  679. fputs(" in parent table", ef);
  680. }
  681. if (entry) {
  682. fputs(" tuple:\n", ef);
  683. /* TODO: DB_TRX_ID and DB_ROLL_PTR may be uninitialized.
  684. It would be better to only display the user columns. */
  685. dtuple_print(ef, entry);
  686. }
  687. fputs("\nBut in parent table ", ef);
  688. ut_print_name(ef, trx, foreign->referenced_table_name);
  689. fprintf(ef, ", in index %s,\n"
  690. "the closest match we can find is record:\n",
  691. foreign->referenced_index->name());
  692. if (rec && page_rec_is_supremum(rec)) {
  693. /* If the cursor ended on a supremum record, it is better
  694. to report the previous record in the error message, so that
  695. the user gets a more descriptive error message. */
  696. rec = page_rec_get_prev_const(rec);
  697. }
  698. if (rec) {
  699. rec_print(ef, rec, foreign->referenced_index);
  700. }
  701. putc('\n', ef);
  702. mutex_exit(&dict_foreign_err_mutex);
  703. }
  704. /*********************************************************************//**
  705. Invalidate the query cache for the given table. */
  706. static
  707. void
  708. row_ins_invalidate_query_cache(
  709. /*===========================*/
  710. que_thr_t* thr, /*!< in: query thread whose run_node
  711. is an update node */
  712. const char* name) /*!< in: table name prefixed with
  713. database name and a '/' character */
  714. {
  715. innobase_invalidate_query_cache(thr_get_trx(thr), name);
  716. }
  717. /** Fill virtual column information in cascade node for the child table.
  718. @param[out] cascade child update node
  719. @param[in] rec clustered rec of child table
  720. @param[in] index clustered index of child table
  721. @param[in] node parent update node
  722. @param[in] foreign foreign key information
  723. @return error code. */
  724. static
  725. dberr_t
  726. row_ins_foreign_fill_virtual(
  727. upd_node_t* cascade,
  728. const rec_t* rec,
  729. dict_index_t* index,
  730. upd_node_t* node,
  731. dict_foreign_t* foreign)
  732. {
  733. THD* thd = current_thd;
  734. row_ext_t* ext;
  735. rec_offs offsets_[REC_OFFS_NORMAL_SIZE];
  736. rec_offs_init(offsets_);
  737. const rec_offs* offsets =
  738. rec_get_offsets(rec, index, offsets_, index->n_core_fields,
  739. ULINT_UNDEFINED, &cascade->heap);
  740. TABLE* mysql_table= NULL;
  741. upd_t* update = cascade->update;
  742. ulint n_v_fld = index->table->n_v_def;
  743. ulint n_diff;
  744. upd_field_t* upd_field;
  745. dict_vcol_set* v_cols = foreign->v_cols;
  746. update->old_vrow = row_build(
  747. ROW_COPY_DATA, index, rec,
  748. offsets, index->table, NULL, NULL,
  749. &ext, update->heap);
  750. n_diff = update->n_fields;
  751. if (index->table->vc_templ == NULL) {
  752. /** This can occur when there is a cascading
  753. delete or update after restart. */
  754. innobase_init_vc_templ(index->table);
  755. }
  756. ib_vcol_row vc(NULL);
  757. uchar *record = vc.record(thd, index, &mysql_table);
  758. if (!record) {
  759. return DB_OUT_OF_MEMORY;
  760. }
  761. for (ulint i = 0; i < n_v_fld; i++) {
  762. dict_v_col_t* col = dict_table_get_nth_v_col(
  763. index->table, i);
  764. dict_vcol_set::iterator it = v_cols->find(col);
  765. if (it == v_cols->end()) {
  766. continue;
  767. }
  768. dfield_t* vfield = innobase_get_computed_value(
  769. update->old_vrow, col, index,
  770. &vc.heap, update->heap, NULL, thd, mysql_table,
  771. record, NULL, NULL, NULL);
  772. if (vfield == NULL) {
  773. return DB_COMPUTE_VALUE_FAILED;
  774. }
  775. upd_field = update->fields + n_diff;
  776. upd_field->old_v_val = static_cast<dfield_t*>(
  777. mem_heap_alloc(cascade->heap,
  778. sizeof *upd_field->old_v_val));
  779. dfield_copy(upd_field->old_v_val, vfield);
  780. upd_field_set_v_field_no(upd_field, i, index);
  781. bool set_null =
  782. node->is_delete
  783. ? (foreign->type & DICT_FOREIGN_ON_DELETE_SET_NULL)
  784. : (foreign->type & DICT_FOREIGN_ON_UPDATE_SET_NULL);
  785. dfield_t* new_vfield = innobase_get_computed_value(
  786. update->old_vrow, col, index,
  787. &vc.heap, update->heap, NULL, thd,
  788. mysql_table, record, NULL,
  789. set_null ? update : node->update, foreign);
  790. if (new_vfield == NULL) {
  791. return DB_COMPUTE_VALUE_FAILED;
  792. }
  793. dfield_copy(&upd_field->new_val, new_vfield);
  794. if (!dfield_datas_are_binary_equal(
  795. upd_field->old_v_val,
  796. &upd_field->new_val, 0))
  797. n_diff++;
  798. }
  799. update->n_fields = n_diff;
  800. return DB_SUCCESS;
  801. }
  802. #ifdef WITH_WSREP
  803. dberr_t wsrep_append_foreign_key(trx_t *trx,
  804. dict_foreign_t* foreign,
  805. const rec_t* clust_rec,
  806. dict_index_t* clust_index,
  807. ibool referenced,
  808. Wsrep_service_key_type key_type);
  809. #endif /* WITH_WSREP */
  810. /*********************************************************************//**
  811. Perform referential actions or checks when a parent row is deleted or updated
  812. and the constraint had an ON DELETE or ON UPDATE condition which was not
  813. RESTRICT.
  814. @return DB_SUCCESS, DB_LOCK_WAIT, or error code */
  815. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  816. dberr_t
  817. row_ins_foreign_check_on_constraint(
  818. /*================================*/
  819. que_thr_t* thr, /*!< in: query thread whose run_node
  820. is an update node */
  821. dict_foreign_t* foreign, /*!< in: foreign key constraint whose
  822. type is != 0 */
  823. btr_pcur_t* pcur, /*!< in: cursor placed on a matching
  824. index record in the child table */
  825. dtuple_t* entry, /*!< in: index entry in the parent
  826. table */
  827. mtr_t* mtr) /*!< in: mtr holding the latch of pcur
  828. page */
  829. {
  830. upd_node_t* node;
  831. upd_node_t* cascade;
  832. dict_table_t* table = foreign->foreign_table;
  833. dict_index_t* index;
  834. dict_index_t* clust_index;
  835. dtuple_t* ref;
  836. const rec_t* rec;
  837. const rec_t* clust_rec;
  838. const buf_block_t* clust_block;
  839. upd_t* update;
  840. dberr_t err;
  841. trx_t* trx;
  842. mem_heap_t* tmp_heap = NULL;
  843. doc_id_t doc_id = FTS_NULL_DOC_ID;
  844. DBUG_ENTER("row_ins_foreign_check_on_constraint");
  845. trx = thr_get_trx(thr);
  846. /* Since we are going to delete or update a row, we have to invalidate
  847. the MySQL query cache for table. A deadlock of threads is not possible
  848. here because the caller of this function does not hold any latches with
  849. the mutex rank above the lock_sys_t::mutex. The query cache mutex
  850. has a rank just above the lock_sys_t::mutex. */
  851. row_ins_invalidate_query_cache(thr, table->name.m_name);
  852. node = static_cast<upd_node_t*>(thr->run_node);
  853. if (node->is_delete && 0 == (foreign->type
  854. & (DICT_FOREIGN_ON_DELETE_CASCADE
  855. | DICT_FOREIGN_ON_DELETE_SET_NULL))) {
  856. row_ins_foreign_report_err("Trying to delete",
  857. thr, foreign,
  858. btr_pcur_get_rec(pcur), entry);
  859. DBUG_RETURN(DB_ROW_IS_REFERENCED);
  860. }
  861. if (!node->is_delete && 0 == (foreign->type
  862. & (DICT_FOREIGN_ON_UPDATE_CASCADE
  863. | DICT_FOREIGN_ON_UPDATE_SET_NULL))) {
  864. /* This is an UPDATE */
  865. row_ins_foreign_report_err("Trying to update",
  866. thr, foreign,
  867. btr_pcur_get_rec(pcur), entry);
  868. DBUG_RETURN(DB_ROW_IS_REFERENCED);
  869. }
  870. if (node->cascade_node == NULL) {
  871. node->cascade_heap = mem_heap_create(128);
  872. node->cascade_node = row_create_update_node_for_mysql(
  873. table, node->cascade_heap);
  874. que_node_set_parent(node->cascade_node, node);
  875. }
  876. cascade = node->cascade_node;
  877. cascade->table = table;
  878. cascade->foreign = foreign;
  879. if (node->is_delete
  880. && (foreign->type & DICT_FOREIGN_ON_DELETE_CASCADE)) {
  881. cascade->is_delete = PLAIN_DELETE;
  882. } else {
  883. cascade->is_delete = NO_DELETE;
  884. if (foreign->n_fields > cascade->update_n_fields) {
  885. /* We have to make the update vector longer */
  886. cascade->update = upd_create(foreign->n_fields,
  887. node->cascade_heap);
  888. cascade->update_n_fields = foreign->n_fields;
  889. }
  890. /* We do not allow cyclic cascaded updating (DELETE is
  891. allowed, but not UPDATE) of the same table, as this
  892. can lead to an infinite cycle. Check that we are not
  893. updating the same table which is already being
  894. modified in this cascade chain. We have to check this
  895. also because the modification of the indexes of a
  896. 'parent' table may still be incomplete, and we must
  897. avoid seeing the indexes of the parent table in an
  898. inconsistent state! */
  899. if (row_ins_cascade_ancestor_updates_table(cascade, table)) {
  900. /* We do not know if this would break foreign key
  901. constraints, but play safe and return an error */
  902. err = DB_ROW_IS_REFERENCED;
  903. row_ins_foreign_report_err(
  904. "Trying an update, possibly causing a cyclic"
  905. " cascaded update\n"
  906. "in the child table,", thr, foreign,
  907. btr_pcur_get_rec(pcur), entry);
  908. goto nonstandard_exit_func;
  909. }
  910. }
  911. if (row_ins_cascade_n_ancestors(cascade) >= FK_MAX_CASCADE_DEL) {
  912. err = DB_FOREIGN_EXCEED_MAX_CASCADE;
  913. row_ins_foreign_report_err(
  914. "Trying a too deep cascaded delete or update\n",
  915. thr, foreign, btr_pcur_get_rec(pcur), entry);
  916. goto nonstandard_exit_func;
  917. }
  918. index = btr_pcur_get_btr_cur(pcur)->index;
  919. ut_a(index == foreign->foreign_index);
  920. rec = btr_pcur_get_rec(pcur);
  921. tmp_heap = mem_heap_create(256);
  922. if (dict_index_is_clust(index)) {
  923. /* pcur is already positioned in the clustered index of
  924. the child table */
  925. clust_index = index;
  926. clust_rec = rec;
  927. clust_block = btr_pcur_get_block(pcur);
  928. } else {
  929. /* We have to look for the record in the clustered index
  930. in the child table */
  931. clust_index = dict_table_get_first_index(table);
  932. ref = row_build_row_ref(ROW_COPY_POINTERS, index, rec,
  933. tmp_heap);
  934. btr_pcur_open_with_no_init(clust_index, ref,
  935. PAGE_CUR_LE, BTR_SEARCH_LEAF,
  936. cascade->pcur, 0, mtr);
  937. clust_rec = btr_pcur_get_rec(cascade->pcur);
  938. clust_block = btr_pcur_get_block(cascade->pcur);
  939. if (!page_rec_is_user_rec(clust_rec)
  940. || btr_pcur_get_low_match(cascade->pcur)
  941. < dict_index_get_n_unique(clust_index)) {
  942. ib::error() << "In cascade of a foreign key op index "
  943. << index->name
  944. << " of table " << index->table->name;
  945. fputs("InnoDB: record ", stderr);
  946. rec_print(stderr, rec, index);
  947. fputs("\n"
  948. "InnoDB: clustered record ", stderr);
  949. rec_print(stderr, clust_rec, clust_index);
  950. fputs("\n"
  951. "InnoDB: Submit a detailed bug report to"
  952. " https://jira.mariadb.org/\n", stderr);
  953. ut_ad(0);
  954. err = DB_SUCCESS;
  955. goto nonstandard_exit_func;
  956. }
  957. }
  958. /* Set an X-lock on the row to delete or update in the child table */
  959. err = lock_table(0, table, LOCK_IX, thr);
  960. if (err == DB_SUCCESS) {
  961. /* Here it suffices to use a LOCK_REC_NOT_GAP type lock;
  962. we already have a normal shared lock on the appropriate
  963. gap if the search criterion was not unique */
  964. err = lock_clust_rec_read_check_and_lock_alt(
  965. 0, clust_block, clust_rec, clust_index,
  966. LOCK_X, LOCK_REC_NOT_GAP, thr);
  967. }
  968. if (err != DB_SUCCESS) {
  969. goto nonstandard_exit_func;
  970. }
  971. if (rec_get_deleted_flag(clust_rec, dict_table_is_comp(table))) {
  972. /* In delete-marked records, DB_TRX_ID must
  973. always refer to an existing undo log record. */
  974. ut_ad(rec_get_trx_id(clust_rec, clust_index));
  975. /* This can happen if there is a circular reference of
  976. rows such that cascading delete comes to delete a row
  977. already in the process of being delete marked */
  978. err = DB_SUCCESS;
  979. goto nonstandard_exit_func;
  980. }
  981. if (table->fts) {
  982. doc_id = fts_get_doc_id_from_rec(
  983. clust_rec, clust_index,
  984. rec_get_offsets(clust_rec, clust_index, NULL,
  985. clust_index->n_core_fields,
  986. ULINT_UNDEFINED, &tmp_heap));
  987. }
  988. if (node->is_delete
  989. ? (foreign->type & DICT_FOREIGN_ON_DELETE_SET_NULL)
  990. : (foreign->type & DICT_FOREIGN_ON_UPDATE_SET_NULL)) {
  991. /* Build the appropriate update vector which sets
  992. foreign->n_fields first fields in rec to SQL NULL */
  993. update = cascade->update;
  994. update->info_bits = 0;
  995. update->n_fields = foreign->n_fields;
  996. MEM_UNDEFINED(update->fields,
  997. update->n_fields * sizeof *update->fields);
  998. for (ulint i = 0; i < foreign->n_fields; i++) {
  999. upd_field_t* ufield = &update->fields[i];
  1000. ulint col_no = dict_index_get_nth_col_no(
  1001. index, i);
  1002. ulint prefix_col;
  1003. ufield->field_no = dict_table_get_nth_col_pos(
  1004. table, col_no, &prefix_col);
  1005. dict_col_t* col = dict_table_get_nth_col(
  1006. table, col_no);
  1007. dict_col_copy_type(col, dfield_get_type(&ufield->new_val));
  1008. ufield->orig_len = 0;
  1009. ufield->exp = NULL;
  1010. dfield_set_null(&ufield->new_val);
  1011. }
  1012. if (foreign->affects_fulltext()) {
  1013. fts_trx_add_op(trx, table, doc_id, FTS_DELETE, NULL);
  1014. }
  1015. if (foreign->v_cols != NULL
  1016. && foreign->v_cols->size() > 0) {
  1017. err = row_ins_foreign_fill_virtual(
  1018. cascade, clust_rec, clust_index,
  1019. node, foreign);
  1020. if (err != DB_SUCCESS) {
  1021. goto nonstandard_exit_func;
  1022. }
  1023. }
  1024. } else if (table->fts && cascade->is_delete == PLAIN_DELETE
  1025. && foreign->affects_fulltext()) {
  1026. /* DICT_FOREIGN_ON_DELETE_CASCADE case */
  1027. fts_trx_add_op(trx, table, doc_id, FTS_DELETE, NULL);
  1028. }
  1029. if (!node->is_delete
  1030. && (foreign->type & DICT_FOREIGN_ON_UPDATE_CASCADE)) {
  1031. /* Build the appropriate update vector which sets changing
  1032. foreign->n_fields first fields in rec to new values */
  1033. bool affects_fulltext = row_ins_cascade_calc_update_vec(
  1034. node, foreign, tmp_heap, trx);
  1035. if (foreign->v_cols && !foreign->v_cols->empty()) {
  1036. err = row_ins_foreign_fill_virtual(
  1037. cascade, clust_rec, clust_index,
  1038. node, foreign);
  1039. if (err != DB_SUCCESS) {
  1040. goto nonstandard_exit_func;
  1041. }
  1042. }
  1043. switch (cascade->update->n_fields) {
  1044. case ULINT_UNDEFINED:
  1045. err = DB_ROW_IS_REFERENCED;
  1046. row_ins_foreign_report_err(
  1047. "Trying a cascaded update where the"
  1048. " updated value in the child\n"
  1049. "table would not fit in the length"
  1050. " of the column, or the value would\n"
  1051. "be NULL and the column is"
  1052. " declared as not NULL in the child table,",
  1053. thr, foreign, btr_pcur_get_rec(pcur), entry);
  1054. goto nonstandard_exit_func;
  1055. case 0:
  1056. /* The update does not change any columns referred
  1057. to in this foreign key constraint: no need to do
  1058. anything */
  1059. err = DB_SUCCESS;
  1060. goto nonstandard_exit_func;
  1061. }
  1062. /* Mark the old Doc ID as deleted */
  1063. if (affects_fulltext) {
  1064. ut_ad(table->fts);
  1065. fts_trx_add_op(trx, table, doc_id, FTS_DELETE, NULL);
  1066. }
  1067. }
  1068. if (table->versioned() && cascade->is_delete != PLAIN_DELETE
  1069. && cascade->update->affects_versioned()) {
  1070. ut_ad(!cascade->historical_heap);
  1071. cascade->historical_heap = mem_heap_create(srv_page_size);
  1072. cascade->historical_row = row_build(
  1073. ROW_COPY_DATA, clust_index, clust_rec, NULL, table,
  1074. NULL, NULL, NULL, cascade->historical_heap);
  1075. }
  1076. /* Store pcur position and initialize or store the cascade node
  1077. pcur stored position */
  1078. btr_pcur_store_position(pcur, mtr);
  1079. if (index == clust_index) {
  1080. btr_pcur_copy_stored_position(cascade->pcur, pcur);
  1081. } else {
  1082. btr_pcur_store_position(cascade->pcur, mtr);
  1083. }
  1084. #ifdef WITH_WSREP
  1085. err = wsrep_append_foreign_key(trx, foreign, clust_rec, clust_index,
  1086. FALSE, WSREP_SERVICE_KEY_EXCLUSIVE);
  1087. if (err != DB_SUCCESS) {
  1088. ib::info() << "WSREP: foreign key append failed: " << err;
  1089. goto nonstandard_exit_func;
  1090. }
  1091. #endif /* WITH_WSREP */
  1092. mtr_commit(mtr);
  1093. ut_a(cascade->pcur->rel_pos == BTR_PCUR_ON);
  1094. cascade->state = UPD_NODE_UPDATE_CLUSTERED;
  1095. err = row_update_cascade_for_mysql(thr, cascade,
  1096. foreign->foreign_table);
  1097. /* Release the data dictionary latch for a while, so that we do not
  1098. starve other threads from doing CREATE TABLE etc. if we have a huge
  1099. cascaded operation running. */
  1100. row_mysql_unfreeze_data_dictionary(thr_get_trx(thr));
  1101. DEBUG_SYNC_C("innodb_dml_cascade_dict_unfreeze");
  1102. row_mysql_freeze_data_dictionary(thr_get_trx(thr));
  1103. mtr_start(mtr);
  1104. /* Restore pcur position */
  1105. btr_pcur_restore_position(BTR_SEARCH_LEAF, pcur, mtr);
  1106. if (tmp_heap) {
  1107. mem_heap_free(tmp_heap);
  1108. }
  1109. DBUG_RETURN(err);
  1110. nonstandard_exit_func:
  1111. if (tmp_heap) {
  1112. mem_heap_free(tmp_heap);
  1113. }
  1114. btr_pcur_store_position(pcur, mtr);
  1115. mtr_commit(mtr);
  1116. mtr_start(mtr);
  1117. btr_pcur_restore_position(BTR_SEARCH_LEAF, pcur, mtr);
  1118. DBUG_RETURN(err);
  1119. }
  1120. /*********************************************************************//**
  1121. Sets a shared lock on a record. Used in locking possible duplicate key
  1122. records and also in checking foreign key constraints.
  1123. @return DB_SUCCESS, DB_SUCCESS_LOCKED_REC, or error code */
  1124. static
  1125. dberr_t
  1126. row_ins_set_shared_rec_lock(
  1127. /*========================*/
  1128. ulint type, /*!< in: LOCK_ORDINARY, LOCK_GAP, or
  1129. LOCK_REC_NOT_GAP type lock */
  1130. const buf_block_t* block, /*!< in: buffer block of rec */
  1131. const rec_t* rec, /*!< in: record */
  1132. dict_index_t* index, /*!< in: index */
  1133. const rec_offs* offsets,/*!< in: rec_get_offsets(rec, index) */
  1134. que_thr_t* thr) /*!< in: query thread */
  1135. {
  1136. dberr_t err;
  1137. ut_ad(rec_offs_validate(rec, index, offsets));
  1138. if (dict_index_is_clust(index)) {
  1139. err = lock_clust_rec_read_check_and_lock(
  1140. 0, block, rec, index, offsets, LOCK_S, type, thr);
  1141. } else {
  1142. err = lock_sec_rec_read_check_and_lock(
  1143. 0, block, rec, index, offsets, LOCK_S, type, thr);
  1144. }
  1145. return(err);
  1146. }
  1147. /*********************************************************************//**
  1148. Sets a exclusive lock on a record. Used in locking possible duplicate key
  1149. records
  1150. @return DB_SUCCESS, DB_SUCCESS_LOCKED_REC, or error code */
  1151. static
  1152. dberr_t
  1153. row_ins_set_exclusive_rec_lock(
  1154. /*===========================*/
  1155. ulint type, /*!< in: LOCK_ORDINARY, LOCK_GAP, or
  1156. LOCK_REC_NOT_GAP type lock */
  1157. const buf_block_t* block, /*!< in: buffer block of rec */
  1158. const rec_t* rec, /*!< in: record */
  1159. dict_index_t* index, /*!< in: index */
  1160. const rec_offs* offsets,/*!< in: rec_get_offsets(rec, index) */
  1161. que_thr_t* thr) /*!< in: query thread */
  1162. {
  1163. dberr_t err;
  1164. ut_ad(rec_offs_validate(rec, index, offsets));
  1165. if (dict_index_is_clust(index)) {
  1166. err = lock_clust_rec_read_check_and_lock(
  1167. 0, block, rec, index, offsets, LOCK_X, type, thr);
  1168. } else {
  1169. err = lock_sec_rec_read_check_and_lock(
  1170. 0, block, rec, index, offsets, LOCK_X, type, thr);
  1171. }
  1172. return(err);
  1173. }
  1174. /***************************************************************//**
  1175. Checks if foreign key constraint fails for an index entry. Sets shared locks
  1176. which lock either the success or the failure of the constraint. NOTE that
  1177. the caller must have a shared latch on dict_sys.latch.
  1178. @return DB_SUCCESS, DB_NO_REFERENCED_ROW, or DB_ROW_IS_REFERENCED */
  1179. dberr_t
  1180. row_ins_check_foreign_constraint(
  1181. /*=============================*/
  1182. ibool check_ref,/*!< in: TRUE if we want to check that
  1183. the referenced table is ok, FALSE if we
  1184. want to check the foreign key table */
  1185. dict_foreign_t* foreign,/*!< in: foreign constraint; NOTE that the
  1186. tables mentioned in it must be in the
  1187. dictionary cache if they exist at all */
  1188. dict_table_t* table, /*!< in: if check_ref is TRUE, then the foreign
  1189. table, else the referenced table */
  1190. dtuple_t* entry, /*!< in: index entry for index */
  1191. que_thr_t* thr) /*!< in: query thread */
  1192. {
  1193. dberr_t err;
  1194. upd_node_t* upd_node;
  1195. dict_table_t* check_table;
  1196. dict_index_t* check_index;
  1197. ulint n_fields_cmp;
  1198. btr_pcur_t pcur;
  1199. int cmp;
  1200. mtr_t mtr;
  1201. trx_t* trx = thr_get_trx(thr);
  1202. mem_heap_t* heap = NULL;
  1203. rec_offs offsets_[REC_OFFS_NORMAL_SIZE];
  1204. rec_offs* offsets = offsets_;
  1205. bool skip_gap_lock;
  1206. skip_gap_lock = (trx->isolation_level <= TRX_ISO_READ_COMMITTED);
  1207. DBUG_ENTER("row_ins_check_foreign_constraint");
  1208. rec_offs_init(offsets_);
  1209. #ifdef WITH_WSREP
  1210. upd_node= NULL;
  1211. #endif /* WITH_WSREP */
  1212. ut_ad(rw_lock_own(&dict_sys.latch, RW_LOCK_S));
  1213. err = DB_SUCCESS;
  1214. if (trx->check_foreigns == FALSE) {
  1215. /* The user has suppressed foreign key checks currently for
  1216. this session */
  1217. goto exit_func;
  1218. }
  1219. /* If any of the foreign key fields in entry is SQL NULL, we
  1220. suppress the foreign key check: this is compatible with Oracle,
  1221. for example */
  1222. for (ulint i = 0; i < entry->n_fields; i++) {
  1223. dfield_t* field = dtuple_get_nth_field(entry, i);
  1224. if (i < foreign->n_fields && dfield_is_null(field)) {
  1225. goto exit_func;
  1226. }
  1227. /* System Versioning: if row_end != Inf, we
  1228. suppress the foreign key check */
  1229. if (field->type.vers_sys_end() && field->vers_history_row()) {
  1230. goto exit_func;
  1231. }
  1232. }
  1233. if (que_node_get_type(thr->run_node) == QUE_NODE_UPDATE) {
  1234. upd_node = static_cast<upd_node_t*>(thr->run_node);
  1235. if (upd_node->is_delete != PLAIN_DELETE
  1236. && upd_node->foreign == foreign) {
  1237. /* If a cascaded update is done as defined by a
  1238. foreign key constraint, do not check that
  1239. constraint for the child row. In ON UPDATE CASCADE
  1240. the update of the parent row is only half done when
  1241. we come here: if we would check the constraint here
  1242. for the child row it would fail.
  1243. A QUESTION remains: if in the child table there are
  1244. several constraints which refer to the same parent
  1245. table, we should merge all updates to the child as
  1246. one update? And the updates can be contradictory!
  1247. Currently we just perform the update associated
  1248. with each foreign key constraint, one after
  1249. another, and the user has problems predicting in
  1250. which order they are performed. */
  1251. goto exit_func;
  1252. }
  1253. }
  1254. if (que_node_get_type(thr->run_node) == QUE_NODE_INSERT) {
  1255. ins_node_t* insert_node =
  1256. static_cast<ins_node_t*>(thr->run_node);
  1257. dict_table_t* table = insert_node->index->table;
  1258. if (table->versioned()) {
  1259. dfield_t* row_end = dtuple_get_nth_field(
  1260. insert_node->row, table->vers_end);
  1261. if (row_end->vers_history_row()) {
  1262. goto exit_func;
  1263. }
  1264. }
  1265. }
  1266. if (check_ref) {
  1267. check_table = foreign->referenced_table;
  1268. check_index = foreign->referenced_index;
  1269. } else {
  1270. check_table = foreign->foreign_table;
  1271. check_index = foreign->foreign_index;
  1272. }
  1273. if (check_table == NULL
  1274. || !check_table->is_readable()
  1275. || check_index == NULL) {
  1276. FILE* ef = dict_foreign_err_file;
  1277. std::string fk_str;
  1278. row_ins_set_detailed(trx, foreign);
  1279. row_ins_foreign_trx_print(trx);
  1280. fputs("Foreign key constraint fails for table ", ef);
  1281. ut_print_name(ef, trx, check_ref
  1282. ? foreign->foreign_table_name
  1283. : foreign->referenced_table_name);
  1284. fputs(":\n", ef);
  1285. fk_str = dict_print_info_on_foreign_key_in_create_format(
  1286. trx, foreign, TRUE);
  1287. fputs(fk_str.c_str(), ef);
  1288. if (check_ref) {
  1289. if (foreign->foreign_index) {
  1290. fprintf(ef, "\nTrying to add to index %s"
  1291. " tuple:\n",
  1292. foreign->foreign_index->name());
  1293. } else {
  1294. fputs("\nTrying to add tuple:\n", ef);
  1295. }
  1296. dtuple_print(ef, entry);
  1297. fputs("\nBut the parent table ", ef);
  1298. ut_print_name(ef, trx, foreign->referenced_table_name);
  1299. fputs("\nor its .ibd file or the required index does"
  1300. " not currently exist!\n", ef);
  1301. err = DB_NO_REFERENCED_ROW;
  1302. } else {
  1303. if (foreign->referenced_index) {
  1304. fprintf(ef, "\nTrying to modify index %s"
  1305. " tuple:\n",
  1306. foreign->referenced_index->name());
  1307. } else {
  1308. fputs("\nTrying to modify tuple:\n", ef);
  1309. }
  1310. dtuple_print(ef, entry);
  1311. fputs("\nBut the referencing table ", ef);
  1312. ut_print_name(ef, trx, foreign->foreign_table_name);
  1313. fputs("\nor its .ibd file or the required index does"
  1314. " not currently exist!\n", ef);
  1315. err = DB_ROW_IS_REFERENCED;
  1316. }
  1317. mutex_exit(&dict_foreign_err_mutex);
  1318. goto exit_func;
  1319. }
  1320. if (check_table != table) {
  1321. /* We already have a LOCK_IX on table, but not necessarily
  1322. on check_table */
  1323. err = lock_table(0, check_table, LOCK_IS, thr);
  1324. if (err != DB_SUCCESS) {
  1325. goto do_possible_lock_wait;
  1326. }
  1327. }
  1328. mtr_start(&mtr);
  1329. /* Store old value on n_fields_cmp */
  1330. n_fields_cmp = dtuple_get_n_fields_cmp(entry);
  1331. dtuple_set_n_fields_cmp(entry, foreign->n_fields);
  1332. btr_pcur_open(check_index, entry, PAGE_CUR_GE,
  1333. BTR_SEARCH_LEAF, &pcur, &mtr);
  1334. /* Scan index records and check if there is a matching record */
  1335. do {
  1336. const rec_t* rec = btr_pcur_get_rec(&pcur);
  1337. const buf_block_t* block = btr_pcur_get_block(&pcur);
  1338. if (page_rec_is_infimum(rec)) {
  1339. continue;
  1340. }
  1341. offsets = rec_get_offsets(rec, check_index, offsets,
  1342. check_index->n_core_fields,
  1343. ULINT_UNDEFINED, &heap);
  1344. if (page_rec_is_supremum(rec)) {
  1345. if (skip_gap_lock) {
  1346. continue;
  1347. }
  1348. err = row_ins_set_shared_rec_lock(LOCK_ORDINARY, block,
  1349. rec, check_index,
  1350. offsets, thr);
  1351. switch (err) {
  1352. case DB_SUCCESS_LOCKED_REC:
  1353. case DB_SUCCESS:
  1354. continue;
  1355. default:
  1356. goto end_scan;
  1357. }
  1358. }
  1359. cmp = cmp_dtuple_rec(entry, rec, offsets);
  1360. if (cmp == 0) {
  1361. if (check_table->versioned()) {
  1362. bool history_row = false;
  1363. if (check_index->is_primary()) {
  1364. history_row = check_index->
  1365. vers_history_row(rec, offsets);
  1366. } else if (check_index->
  1367. vers_history_row(rec, history_row))
  1368. {
  1369. break;
  1370. }
  1371. if (history_row) {
  1372. continue;
  1373. }
  1374. }
  1375. if (rec_get_deleted_flag(rec,
  1376. rec_offs_comp(offsets))) {
  1377. /* In delete-marked records, DB_TRX_ID must
  1378. always refer to an existing undo log record. */
  1379. ut_ad(!dict_index_is_clust(check_index)
  1380. || row_get_rec_trx_id(rec, check_index,
  1381. offsets));
  1382. err = row_ins_set_shared_rec_lock(
  1383. skip_gap_lock
  1384. ? LOCK_REC_NOT_GAP
  1385. : LOCK_ORDINARY, block,
  1386. rec, check_index, offsets, thr);
  1387. switch (err) {
  1388. case DB_SUCCESS_LOCKED_REC:
  1389. case DB_SUCCESS:
  1390. break;
  1391. default:
  1392. goto end_scan;
  1393. }
  1394. } else {
  1395. /* Found a matching record. Lock only
  1396. a record because we can allow inserts
  1397. into gaps */
  1398. err = row_ins_set_shared_rec_lock(
  1399. LOCK_REC_NOT_GAP, block,
  1400. rec, check_index, offsets, thr);
  1401. switch (err) {
  1402. case DB_SUCCESS_LOCKED_REC:
  1403. case DB_SUCCESS:
  1404. break;
  1405. default:
  1406. goto end_scan;
  1407. }
  1408. if (check_ref) {
  1409. err = DB_SUCCESS;
  1410. #ifdef WITH_WSREP
  1411. err = wsrep_append_foreign_key(
  1412. thr_get_trx(thr),
  1413. foreign,
  1414. rec,
  1415. check_index,
  1416. check_ref,
  1417. (upd_node != NULL
  1418. && wsrep_protocol_version < 4)
  1419. ? WSREP_SERVICE_KEY_SHARED
  1420. : WSREP_SERVICE_KEY_REFERENCE);
  1421. if (err != DB_SUCCESS) {
  1422. fprintf(stderr,
  1423. "WSREP: foreign key append failed: %d\n", err);
  1424. }
  1425. #endif /* WITH_WSREP */
  1426. goto end_scan;
  1427. } else if (foreign->type != 0) {
  1428. /* There is an ON UPDATE or ON DELETE
  1429. condition: check them in a separate
  1430. function */
  1431. err = row_ins_foreign_check_on_constraint(
  1432. thr, foreign, &pcur, entry,
  1433. &mtr);
  1434. if (err != DB_SUCCESS) {
  1435. /* Since reporting a plain
  1436. "duplicate key" error
  1437. message to the user in
  1438. cases where a long CASCADE
  1439. operation would lead to a
  1440. duplicate key in some
  1441. other table is very
  1442. confusing, map duplicate
  1443. key errors resulting from
  1444. FK constraints to a
  1445. separate error code. */
  1446. if (err == DB_DUPLICATE_KEY) {
  1447. err = DB_FOREIGN_DUPLICATE_KEY;
  1448. }
  1449. goto end_scan;
  1450. }
  1451. /* row_ins_foreign_check_on_constraint
  1452. may have repositioned pcur on a
  1453. different block */
  1454. block = btr_pcur_get_block(&pcur);
  1455. } else {
  1456. row_ins_foreign_report_err(
  1457. "Trying to delete or update",
  1458. thr, foreign, rec, entry);
  1459. err = DB_ROW_IS_REFERENCED;
  1460. goto end_scan;
  1461. }
  1462. }
  1463. } else {
  1464. ut_a(cmp < 0);
  1465. err = skip_gap_lock
  1466. ? DB_SUCCESS
  1467. : row_ins_set_shared_rec_lock(
  1468. LOCK_GAP, block,
  1469. rec, check_index, offsets, thr);
  1470. switch (err) {
  1471. case DB_SUCCESS_LOCKED_REC:
  1472. err = DB_SUCCESS;
  1473. /* fall through */
  1474. case DB_SUCCESS:
  1475. if (check_ref) {
  1476. err = DB_NO_REFERENCED_ROW;
  1477. row_ins_foreign_report_add_err(
  1478. trx, foreign, rec, entry);
  1479. }
  1480. default:
  1481. break;
  1482. }
  1483. goto end_scan;
  1484. }
  1485. } while (btr_pcur_move_to_next(&pcur, &mtr));
  1486. if (check_ref) {
  1487. row_ins_foreign_report_add_err(
  1488. trx, foreign, btr_pcur_get_rec(&pcur), entry);
  1489. err = DB_NO_REFERENCED_ROW;
  1490. } else {
  1491. err = DB_SUCCESS;
  1492. }
  1493. end_scan:
  1494. btr_pcur_close(&pcur);
  1495. mtr_commit(&mtr);
  1496. /* Restore old value */
  1497. dtuple_set_n_fields_cmp(entry, n_fields_cmp);
  1498. do_possible_lock_wait:
  1499. if (err == DB_LOCK_WAIT) {
  1500. trx->error_state = err;
  1501. que_thr_stop_for_mysql(thr);
  1502. thr->lock_state = QUE_THR_LOCK_ROW;
  1503. check_table->inc_fk_checks();
  1504. lock_wait_suspend_thread(thr);
  1505. thr->lock_state = QUE_THR_LOCK_NOLOCK;
  1506. err = trx->error_state;
  1507. if (err != DB_SUCCESS) {
  1508. } else if (check_table->to_be_dropped) {
  1509. err = DB_LOCK_WAIT_TIMEOUT;
  1510. } else {
  1511. err = DB_LOCK_WAIT;
  1512. }
  1513. check_table->dec_fk_checks();
  1514. }
  1515. exit_func:
  1516. if (heap != NULL) {
  1517. mem_heap_free(heap);
  1518. }
  1519. DBUG_RETURN(err);
  1520. }
  1521. /** Sets the values of the dtuple fields in ref_entry from the values of
  1522. foreign columns in entry.
  1523. @param[in] foreign foreign key constraint
  1524. @param[in] index clustered index
  1525. @param[in] entry tuple of clustered index
  1526. @param[in] ref_entry tuple of foreign columns
  1527. @return true if all foreign key fields present in clustered index */
  1528. static
  1529. bool row_ins_foreign_index_entry(dict_foreign_t *foreign,
  1530. const dict_index_t *index,
  1531. const dtuple_t *entry,
  1532. dtuple_t *ref_entry)
  1533. {
  1534. for (ulint i= 0; i < foreign->n_fields; i++)
  1535. {
  1536. for (ulint j= 0; j < index->n_fields; j++)
  1537. {
  1538. const dict_col_t *col= dict_index_get_nth_col(index, j);
  1539. /* A clustered index may contain instantly dropped columns,
  1540. which must be skipped. */
  1541. if (col->is_dropped())
  1542. continue;
  1543. const char *col_name= dict_table_get_col_name(index->table, col->ind);
  1544. if (0 == innobase_strcasecmp(col_name, foreign->foreign_col_names[i]))
  1545. {
  1546. dfield_copy(&ref_entry->fields[i], &entry->fields[j]);
  1547. goto got_match;
  1548. }
  1549. }
  1550. return false;
  1551. got_match:
  1552. continue;
  1553. }
  1554. return true;
  1555. }
  1556. /***************************************************************//**
  1557. Checks if foreign key constraints fail for an index entry. If index
  1558. is not mentioned in any constraint, this function does nothing,
  1559. Otherwise does searches to the indexes of referenced tables and
  1560. sets shared locks which lock either the success or the failure of
  1561. a constraint.
  1562. @return DB_SUCCESS or error code */
  1563. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  1564. dberr_t
  1565. row_ins_check_foreign_constraints(
  1566. /*==============================*/
  1567. dict_table_t* table, /*!< in: table */
  1568. dict_index_t* index, /*!< in: index */
  1569. bool pk, /*!< in: index->is_primary() */
  1570. dtuple_t* entry, /*!< in: index entry for index */
  1571. que_thr_t* thr) /*!< in: query thread */
  1572. {
  1573. dict_foreign_t* foreign;
  1574. dberr_t err = DB_SUCCESS;
  1575. trx_t* trx;
  1576. ibool got_s_lock = FALSE;
  1577. mem_heap_t* heap = NULL;
  1578. DBUG_ASSERT(index->is_primary() == pk);
  1579. trx = thr_get_trx(thr);
  1580. DEBUG_SYNC_C_IF_THD(thr_get_trx(thr)->mysql_thd,
  1581. "foreign_constraint_check_for_ins");
  1582. for (dict_foreign_set::iterator it = table->foreign_set.begin();
  1583. err == DB_SUCCESS && it != table->foreign_set.end();
  1584. ++it) {
  1585. foreign = *it;
  1586. if (foreign->foreign_index == index
  1587. || (pk && !foreign->foreign_index)) {
  1588. dtuple_t* ref_tuple = entry;
  1589. if (UNIV_UNLIKELY(!foreign->foreign_index)) {
  1590. /* Change primary key entry to
  1591. foreign key index entry */
  1592. if (!heap) {
  1593. heap = mem_heap_create(1000);
  1594. } else {
  1595. mem_heap_empty(heap);
  1596. }
  1597. ref_tuple = dtuple_create(
  1598. heap, foreign->n_fields);
  1599. dtuple_set_n_fields_cmp(
  1600. ref_tuple, foreign->n_fields);
  1601. if (!row_ins_foreign_index_entry(
  1602. foreign, index, entry, ref_tuple)) {
  1603. err = DB_NO_REFERENCED_ROW;
  1604. break;
  1605. }
  1606. }
  1607. dict_table_t* ref_table = NULL;
  1608. dict_table_t* referenced_table
  1609. = foreign->referenced_table;
  1610. if (referenced_table == NULL) {
  1611. ref_table = dict_table_open_on_name(
  1612. foreign->referenced_table_name_lookup,
  1613. FALSE, FALSE, DICT_ERR_IGNORE_NONE);
  1614. }
  1615. if (0 == trx->dict_operation_lock_mode) {
  1616. got_s_lock = TRUE;
  1617. row_mysql_freeze_data_dictionary(trx);
  1618. }
  1619. if (referenced_table) {
  1620. foreign->foreign_table->inc_fk_checks();
  1621. }
  1622. /* NOTE that if the thread ends up waiting for a lock
  1623. we will release dict_sys.latch temporarily!
  1624. But the counter on the table protects the referenced
  1625. table from being dropped while the check is running. */
  1626. err = row_ins_check_foreign_constraint(
  1627. TRUE, foreign, table, ref_tuple, thr);
  1628. if (referenced_table) {
  1629. foreign->foreign_table->dec_fk_checks();
  1630. }
  1631. if (got_s_lock) {
  1632. row_mysql_unfreeze_data_dictionary(trx);
  1633. }
  1634. if (ref_table != NULL) {
  1635. dict_table_close(ref_table, FALSE, FALSE);
  1636. }
  1637. }
  1638. }
  1639. if (UNIV_LIKELY_NULL(heap)) {
  1640. mem_heap_free(heap);
  1641. }
  1642. return err;
  1643. }
  1644. /***************************************************************//**
  1645. Checks if a unique key violation to rec would occur at the index entry
  1646. insert.
  1647. @return TRUE if error */
  1648. static
  1649. ibool
  1650. row_ins_dupl_error_with_rec(
  1651. /*========================*/
  1652. const rec_t* rec, /*!< in: user record; NOTE that we assume
  1653. that the caller already has a record lock on
  1654. the record! */
  1655. const dtuple_t* entry, /*!< in: entry to insert */
  1656. dict_index_t* index, /*!< in: index */
  1657. const rec_offs* offsets)/*!< in: rec_get_offsets(rec, index) */
  1658. {
  1659. ulint matched_fields;
  1660. ulint n_unique;
  1661. ulint i;
  1662. ut_ad(rec_offs_validate(rec, index, offsets));
  1663. n_unique = dict_index_get_n_unique(index);
  1664. matched_fields = 0;
  1665. cmp_dtuple_rec_with_match(entry, rec, offsets, &matched_fields);
  1666. if (matched_fields < n_unique) {
  1667. return(FALSE);
  1668. }
  1669. /* In a unique secondary index we allow equal key values if they
  1670. contain SQL NULLs */
  1671. if (!dict_index_is_clust(index) && !index->nulls_equal) {
  1672. for (i = 0; i < n_unique; i++) {
  1673. if (dfield_is_null(dtuple_get_nth_field(entry, i))) {
  1674. return(FALSE);
  1675. }
  1676. }
  1677. }
  1678. return(!rec_get_deleted_flag(rec, rec_offs_comp(offsets)));
  1679. }
  1680. /***************************************************************//**
  1681. Scans a unique non-clustered index at a given index entry to determine
  1682. whether a uniqueness violation has occurred for the key value of the entry.
  1683. Set shared locks on possible duplicate records.
  1684. @return DB_SUCCESS, DB_DUPLICATE_KEY, or DB_LOCK_WAIT */
  1685. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  1686. dberr_t
  1687. row_ins_scan_sec_index_for_duplicate(
  1688. /*=================================*/
  1689. ulint flags, /*!< in: undo logging and locking flags */
  1690. dict_index_t* index, /*!< in: non-clustered unique index */
  1691. dtuple_t* entry, /*!< in: index entry */
  1692. que_thr_t* thr, /*!< in: query thread */
  1693. bool s_latch,/*!< in: whether index->lock is being held */
  1694. mtr_t* mtr, /*!< in/out: mini-transaction */
  1695. mem_heap_t* offsets_heap)
  1696. /*!< in/out: memory heap that can be emptied */
  1697. {
  1698. ulint n_unique;
  1699. int cmp;
  1700. ulint n_fields_cmp;
  1701. btr_pcur_t pcur;
  1702. dberr_t err = DB_SUCCESS;
  1703. ulint allow_duplicates;
  1704. rec_offs offsets_[REC_OFFS_SEC_INDEX_SIZE];
  1705. rec_offs* offsets = offsets_;
  1706. DBUG_ENTER("row_ins_scan_sec_index_for_duplicate");
  1707. rec_offs_init(offsets_);
  1708. ut_ad(s_latch == rw_lock_own_flagged(
  1709. &index->lock, RW_LOCK_FLAG_S | RW_LOCK_FLAG_SX));
  1710. n_unique = dict_index_get_n_unique(index);
  1711. /* If the secondary index is unique, but one of the fields in the
  1712. n_unique first fields is NULL, a unique key violation cannot occur,
  1713. since we define NULL != NULL in this case */
  1714. if (!index->nulls_equal) {
  1715. for (ulint i = 0; i < n_unique; i++) {
  1716. if (UNIV_SQL_NULL == dfield_get_len(
  1717. dtuple_get_nth_field(entry, i))) {
  1718. DBUG_RETURN(DB_SUCCESS);
  1719. }
  1720. }
  1721. }
  1722. /* Store old value on n_fields_cmp */
  1723. n_fields_cmp = dtuple_get_n_fields_cmp(entry);
  1724. dtuple_set_n_fields_cmp(entry, n_unique);
  1725. btr_pcur_open(index, entry, PAGE_CUR_GE,
  1726. s_latch
  1727. ? BTR_SEARCH_LEAF_ALREADY_S_LATCHED
  1728. : BTR_SEARCH_LEAF,
  1729. &pcur, mtr);
  1730. allow_duplicates = thr_get_trx(thr)->duplicates;
  1731. /* Scan index records and check if there is a duplicate */
  1732. do {
  1733. const rec_t* rec = btr_pcur_get_rec(&pcur);
  1734. const buf_block_t* block = btr_pcur_get_block(&pcur);
  1735. const ulint lock_type = LOCK_ORDINARY;
  1736. if (page_rec_is_infimum(rec)) {
  1737. continue;
  1738. }
  1739. offsets = rec_get_offsets(rec, index, offsets,
  1740. index->n_core_fields,
  1741. ULINT_UNDEFINED, &offsets_heap);
  1742. if (flags & BTR_NO_LOCKING_FLAG) {
  1743. /* Set no locks when applying log
  1744. in online table rebuild. */
  1745. } else if (allow_duplicates) {
  1746. /* If the SQL-query will update or replace
  1747. duplicate key we will take X-lock for
  1748. duplicates ( REPLACE, LOAD DATAFILE REPLACE,
  1749. INSERT ON DUPLICATE KEY UPDATE). */
  1750. err = row_ins_set_exclusive_rec_lock(
  1751. lock_type, block, rec, index, offsets, thr);
  1752. } else {
  1753. err = row_ins_set_shared_rec_lock(
  1754. lock_type, block, rec, index, offsets, thr);
  1755. }
  1756. switch (err) {
  1757. case DB_SUCCESS_LOCKED_REC:
  1758. err = DB_SUCCESS;
  1759. case DB_SUCCESS:
  1760. break;
  1761. default:
  1762. goto end_scan;
  1763. }
  1764. if (page_rec_is_supremum(rec)) {
  1765. continue;
  1766. }
  1767. cmp = cmp_dtuple_rec(entry, rec, offsets);
  1768. if (cmp == 0) {
  1769. if (row_ins_dupl_error_with_rec(rec, entry,
  1770. index, offsets)) {
  1771. err = DB_DUPLICATE_KEY;
  1772. thr_get_trx(thr)->error_info = index;
  1773. /* If the duplicate is on hidden FTS_DOC_ID,
  1774. state so in the error log */
  1775. if (index == index->table->fts_doc_id_index
  1776. && DICT_TF2_FLAG_IS_SET(
  1777. index->table,
  1778. DICT_TF2_FTS_HAS_DOC_ID)) {
  1779. ib::error() << "Duplicate FTS_DOC_ID"
  1780. " value on table "
  1781. << index->table->name;
  1782. }
  1783. goto end_scan;
  1784. }
  1785. } else {
  1786. ut_a(cmp < 0);
  1787. goto end_scan;
  1788. }
  1789. } while (btr_pcur_move_to_next(&pcur, mtr));
  1790. end_scan:
  1791. /* Restore old value */
  1792. dtuple_set_n_fields_cmp(entry, n_fields_cmp);
  1793. DBUG_RETURN(err);
  1794. }
  1795. /** Checks for a duplicate when the table is being rebuilt online.
  1796. @retval DB_SUCCESS when no duplicate is detected
  1797. @retval DB_SUCCESS_LOCKED_REC when rec is an exact match of entry or
  1798. a newer version of entry (the entry should not be inserted)
  1799. @retval DB_DUPLICATE_KEY when entry is a duplicate of rec */
  1800. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  1801. dberr_t
  1802. row_ins_duplicate_online(
  1803. /*=====================*/
  1804. ulint n_uniq, /*!< in: offset of DB_TRX_ID */
  1805. const dtuple_t* entry, /*!< in: entry that is being inserted */
  1806. const rec_t* rec, /*!< in: clustered index record */
  1807. rec_offs* offsets)/*!< in/out: rec_get_offsets(rec) */
  1808. {
  1809. ulint fields = 0;
  1810. /* During rebuild, there should not be any delete-marked rows
  1811. in the new table. */
  1812. ut_ad(!rec_get_deleted_flag(rec, rec_offs_comp(offsets)));
  1813. ut_ad(dtuple_get_n_fields_cmp(entry) == n_uniq);
  1814. /* Compare the PRIMARY KEY fields and the
  1815. DB_TRX_ID, DB_ROLL_PTR. */
  1816. cmp_dtuple_rec_with_match_low(
  1817. entry, rec, offsets, n_uniq + 2, &fields);
  1818. if (fields < n_uniq) {
  1819. /* Not a duplicate. */
  1820. return(DB_SUCCESS);
  1821. }
  1822. ulint trx_id_len;
  1823. if (fields == n_uniq + 2
  1824. && memcmp(rec_get_nth_field(rec, offsets, n_uniq, &trx_id_len),
  1825. reset_trx_id, DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN)) {
  1826. ut_ad(trx_id_len == DATA_TRX_ID_LEN);
  1827. /* rec is an exact match of entry, and DB_TRX_ID belongs
  1828. to a transaction that started after our ALTER TABLE. */
  1829. return(DB_SUCCESS_LOCKED_REC);
  1830. }
  1831. return(DB_DUPLICATE_KEY);
  1832. }
  1833. /** Checks for a duplicate when the table is being rebuilt online.
  1834. @retval DB_SUCCESS when no duplicate is detected
  1835. @retval DB_SUCCESS_LOCKED_REC when rec is an exact match of entry or
  1836. a newer version of entry (the entry should not be inserted)
  1837. @retval DB_DUPLICATE_KEY when entry is a duplicate of rec */
  1838. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  1839. dberr_t
  1840. row_ins_duplicate_error_in_clust_online(
  1841. /*====================================*/
  1842. ulint n_uniq, /*!< in: offset of DB_TRX_ID */
  1843. const dtuple_t* entry, /*!< in: entry that is being inserted */
  1844. const btr_cur_t*cursor, /*!< in: cursor on insert position */
  1845. rec_offs** offsets,/*!< in/out: rec_get_offsets(rec) */
  1846. mem_heap_t** heap) /*!< in/out: heap for offsets */
  1847. {
  1848. dberr_t err = DB_SUCCESS;
  1849. const rec_t* rec = btr_cur_get_rec(cursor);
  1850. ut_ad(!cursor->index->is_instant());
  1851. if (cursor->low_match >= n_uniq && !page_rec_is_infimum(rec)) {
  1852. *offsets = rec_get_offsets(rec, cursor->index, *offsets,
  1853. cursor->index->n_fields,
  1854. ULINT_UNDEFINED, heap);
  1855. err = row_ins_duplicate_online(n_uniq, entry, rec, *offsets);
  1856. if (err != DB_SUCCESS) {
  1857. return(err);
  1858. }
  1859. }
  1860. rec = page_rec_get_next_const(btr_cur_get_rec(cursor));
  1861. if (cursor->up_match >= n_uniq && !page_rec_is_supremum(rec)) {
  1862. *offsets = rec_get_offsets(rec, cursor->index, *offsets,
  1863. cursor->index->n_fields,
  1864. ULINT_UNDEFINED, heap);
  1865. err = row_ins_duplicate_online(n_uniq, entry, rec, *offsets);
  1866. }
  1867. return(err);
  1868. }
  1869. /***************************************************************//**
  1870. Checks if a unique key violation error would occur at an index entry
  1871. insert. Sets shared locks on possible duplicate records. Works only
  1872. for a clustered index!
  1873. @retval DB_SUCCESS if no error
  1874. @retval DB_DUPLICATE_KEY if error,
  1875. @retval DB_LOCK_WAIT if we have to wait for a lock on a possible duplicate
  1876. record */
  1877. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  1878. dberr_t
  1879. row_ins_duplicate_error_in_clust(
  1880. ulint flags, /*!< in: undo logging and locking flags */
  1881. btr_cur_t* cursor, /*!< in: B-tree cursor */
  1882. const dtuple_t* entry, /*!< in: entry to insert */
  1883. que_thr_t* thr) /*!< in: query thread */
  1884. {
  1885. dberr_t err;
  1886. rec_t* rec;
  1887. ulint n_unique;
  1888. trx_t* trx = thr_get_trx(thr);
  1889. mem_heap_t*heap = NULL;
  1890. rec_offs offsets_[REC_OFFS_NORMAL_SIZE];
  1891. rec_offs* offsets = offsets_;
  1892. rec_offs_init(offsets_);
  1893. ut_ad(dict_index_is_clust(cursor->index));
  1894. /* NOTE: For unique non-clustered indexes there may be any number
  1895. of delete marked records with the same value for the non-clustered
  1896. index key (remember multiversioning), and which differ only in
  1897. the row refererence part of the index record, containing the
  1898. clustered index key fields. For such a secondary index record,
  1899. to avoid race condition, we must FIRST do the insertion and after
  1900. that check that the uniqueness condition is not breached! */
  1901. /* NOTE: A problem is that in the B-tree node pointers on an
  1902. upper level may match more to the entry than the actual existing
  1903. user records on the leaf level. So, even if low_match would suggest
  1904. that a duplicate key violation may occur, this may not be the case. */
  1905. n_unique = dict_index_get_n_unique(cursor->index);
  1906. if (cursor->low_match >= n_unique) {
  1907. rec = btr_cur_get_rec(cursor);
  1908. if (!page_rec_is_infimum(rec)) {
  1909. offsets = rec_get_offsets(rec, cursor->index, offsets,
  1910. cursor->index->n_core_fields,
  1911. ULINT_UNDEFINED, &heap);
  1912. /* We set a lock on the possible duplicate: this
  1913. is needed in logical logging of MySQL to make
  1914. sure that in roll-forward we get the same duplicate
  1915. errors as in original execution */
  1916. if (flags & BTR_NO_LOCKING_FLAG) {
  1917. /* Do nothing if no-locking is set */
  1918. err = DB_SUCCESS;
  1919. } else if (trx->duplicates) {
  1920. /* If the SQL-query will update or replace
  1921. duplicate key we will take X-lock for
  1922. duplicates ( REPLACE, LOAD DATAFILE REPLACE,
  1923. INSERT ON DUPLICATE KEY UPDATE). */
  1924. err = row_ins_set_exclusive_rec_lock(
  1925. LOCK_REC_NOT_GAP,
  1926. btr_cur_get_block(cursor),
  1927. rec, cursor->index, offsets, thr);
  1928. } else {
  1929. err = row_ins_set_shared_rec_lock(
  1930. LOCK_REC_NOT_GAP,
  1931. btr_cur_get_block(cursor), rec,
  1932. cursor->index, offsets, thr);
  1933. }
  1934. switch (err) {
  1935. case DB_SUCCESS_LOCKED_REC:
  1936. case DB_SUCCESS:
  1937. break;
  1938. default:
  1939. goto func_exit;
  1940. }
  1941. if (row_ins_dupl_error_with_rec(
  1942. rec, entry, cursor->index, offsets)) {
  1943. duplicate:
  1944. trx->error_info = cursor->index;
  1945. err = DB_DUPLICATE_KEY;
  1946. if (cursor->index->table->versioned()
  1947. && entry->vers_history_row())
  1948. {
  1949. ulint trx_id_len;
  1950. byte *trx_id = rec_get_nth_field(
  1951. rec, offsets, n_unique,
  1952. &trx_id_len);
  1953. ut_ad(trx_id_len == DATA_TRX_ID_LEN);
  1954. if (trx->id == trx_read_trx_id(trx_id)) {
  1955. err = DB_FOREIGN_DUPLICATE_KEY;
  1956. }
  1957. }
  1958. goto func_exit;
  1959. }
  1960. }
  1961. }
  1962. if (cursor->up_match >= n_unique) {
  1963. rec = page_rec_get_next(btr_cur_get_rec(cursor));
  1964. if (!page_rec_is_supremum(rec)) {
  1965. offsets = rec_get_offsets(rec, cursor->index, offsets,
  1966. cursor->index->n_core_fields,
  1967. ULINT_UNDEFINED, &heap);
  1968. if (trx->duplicates) {
  1969. /* If the SQL-query will update or replace
  1970. duplicate key we will take X-lock for
  1971. duplicates ( REPLACE, LOAD DATAFILE REPLACE,
  1972. INSERT ON DUPLICATE KEY UPDATE). */
  1973. err = row_ins_set_exclusive_rec_lock(
  1974. LOCK_REC_NOT_GAP,
  1975. btr_cur_get_block(cursor),
  1976. rec, cursor->index, offsets, thr);
  1977. } else {
  1978. err = row_ins_set_shared_rec_lock(
  1979. LOCK_REC_NOT_GAP,
  1980. btr_cur_get_block(cursor),
  1981. rec, cursor->index, offsets, thr);
  1982. }
  1983. switch (err) {
  1984. case DB_SUCCESS_LOCKED_REC:
  1985. case DB_SUCCESS:
  1986. break;
  1987. default:
  1988. goto func_exit;
  1989. }
  1990. if (row_ins_dupl_error_with_rec(
  1991. rec, entry, cursor->index, offsets)) {
  1992. goto duplicate;
  1993. }
  1994. }
  1995. /* This should never happen */
  1996. ut_error;
  1997. }
  1998. err = DB_SUCCESS;
  1999. func_exit:
  2000. if (UNIV_LIKELY_NULL(heap)) {
  2001. mem_heap_free(heap);
  2002. }
  2003. return(err);
  2004. }
  2005. /***************************************************************//**
  2006. Checks if an index entry has long enough common prefix with an
  2007. existing record so that the intended insert of the entry must be
  2008. changed to a modify of the existing record. In the case of a clustered
  2009. index, the prefix must be n_unique fields long. In the case of a
  2010. secondary index, all fields must be equal. InnoDB never updates
  2011. secondary index records in place, other than clearing or setting the
  2012. delete-mark flag. We could be able to update the non-unique fields
  2013. of a unique secondary index record by checking the cursor->up_match,
  2014. but we do not do so, because it could have some locking implications.
  2015. @return TRUE if the existing record should be updated; FALSE if not */
  2016. UNIV_INLINE
  2017. ibool
  2018. row_ins_must_modify_rec(
  2019. /*====================*/
  2020. const btr_cur_t* cursor) /*!< in: B-tree cursor */
  2021. {
  2022. /* NOTE: (compare to the note in row_ins_duplicate_error_in_clust)
  2023. Because node pointers on upper levels of the B-tree may match more
  2024. to entry than to actual user records on the leaf level, we
  2025. have to check if the candidate record is actually a user record.
  2026. A clustered index node pointer contains index->n_unique first fields,
  2027. and a secondary index node pointer contains all index fields. */
  2028. return(cursor->low_match
  2029. >= dict_index_get_n_unique_in_tree(cursor->index)
  2030. && !page_rec_is_infimum(btr_cur_get_rec(cursor)));
  2031. }
  2032. /** Insert the externally stored fields (off-page columns)
  2033. of a clustered index entry.
  2034. @param[in] entry index entry to insert
  2035. @param[in] big_rec externally stored fields
  2036. @param[in,out] offsets rec_get_offsets()
  2037. @param[in,out] heap memory heap
  2038. @param[in] thd client connection, or NULL
  2039. @param[in] index clustered index
  2040. @return error code
  2041. @retval DB_SUCCESS
  2042. @retval DB_OUT_OF_FILE_SPACE */
  2043. static
  2044. dberr_t
  2045. row_ins_index_entry_big_rec(
  2046. const dtuple_t* entry,
  2047. const big_rec_t* big_rec,
  2048. rec_offs* offsets,
  2049. mem_heap_t** heap,
  2050. dict_index_t* index,
  2051. const void* thd __attribute__((unused)))
  2052. {
  2053. mtr_t mtr;
  2054. btr_pcur_t pcur;
  2055. rec_t* rec;
  2056. dberr_t error;
  2057. ut_ad(dict_index_is_clust(index));
  2058. DEBUG_SYNC_C_IF_THD(thd, "before_row_ins_extern_latch");
  2059. mtr.start();
  2060. if (index->table->is_temporary()) {
  2061. mtr.set_log_mode(MTR_LOG_NO_REDO);
  2062. } else {
  2063. index->set_modified(mtr);
  2064. }
  2065. btr_pcur_open(index, entry, PAGE_CUR_LE, BTR_MODIFY_TREE,
  2066. &pcur, &mtr);
  2067. rec = btr_pcur_get_rec(&pcur);
  2068. offsets = rec_get_offsets(rec, index, offsets, index->n_core_fields,
  2069. ULINT_UNDEFINED, heap);
  2070. DEBUG_SYNC_C_IF_THD(thd, "before_row_ins_extern");
  2071. error = btr_store_big_rec_extern_fields(
  2072. &pcur, offsets, big_rec, &mtr, BTR_STORE_INSERT);
  2073. DEBUG_SYNC_C_IF_THD(thd, "after_row_ins_extern");
  2074. if (error == DB_SUCCESS
  2075. && dict_index_is_online_ddl(index)) {
  2076. row_log_table_insert(btr_pcur_get_rec(&pcur), index, offsets);
  2077. }
  2078. mtr.commit();
  2079. btr_pcur_close(&pcur);
  2080. return(error);
  2081. }
  2082. /***************************************************************//**
  2083. Tries to insert an entry into a clustered index, ignoring foreign key
  2084. constraints. If a record with the same unique key is found, the other
  2085. record is necessarily marked deleted by a committed transaction, or a
  2086. unique key violation error occurs. The delete marked record is then
  2087. updated to an existing record, and we must write an undo log record on
  2088. the delete marked record.
  2089. @retval DB_SUCCESS on success
  2090. @retval DB_LOCK_WAIT on lock wait when !(flags & BTR_NO_LOCKING_FLAG)
  2091. @retval DB_FAIL if retry with BTR_MODIFY_TREE is needed
  2092. @return error code */
  2093. dberr_t
  2094. row_ins_clust_index_entry_low(
  2095. /*==========================*/
  2096. ulint flags, /*!< in: undo logging and locking flags */
  2097. ulint mode, /*!< in: BTR_MODIFY_LEAF or BTR_MODIFY_TREE,
  2098. depending on whether we wish optimistic or
  2099. pessimistic descent down the index tree */
  2100. dict_index_t* index, /*!< in: clustered index */
  2101. ulint n_uniq, /*!< in: 0 or index->n_uniq */
  2102. dtuple_t* entry, /*!< in/out: index entry to insert */
  2103. ulint n_ext, /*!< in: number of externally stored columns */
  2104. que_thr_t* thr) /*!< in: query thread */
  2105. {
  2106. btr_pcur_t pcur;
  2107. btr_cur_t* cursor;
  2108. dberr_t err = DB_SUCCESS;
  2109. big_rec_t* big_rec = NULL;
  2110. mtr_t mtr;
  2111. ib_uint64_t auto_inc = 0;
  2112. mem_heap_t* offsets_heap = NULL;
  2113. rec_offs offsets_[REC_OFFS_NORMAL_SIZE];
  2114. rec_offs* offsets = offsets_;
  2115. rec_offs_init(offsets_);
  2116. DBUG_ENTER("row_ins_clust_index_entry_low");
  2117. ut_ad(dict_index_is_clust(index));
  2118. ut_ad(!dict_index_is_unique(index)
  2119. || n_uniq == dict_index_get_n_unique(index));
  2120. ut_ad(!n_uniq || n_uniq == dict_index_get_n_unique(index));
  2121. ut_ad(!thr_get_trx(thr)->in_rollback);
  2122. mtr_start(&mtr);
  2123. if (index->table->is_temporary()) {
  2124. /* Disable REDO logging as the lifetime of temp-tables is
  2125. limited to server or connection lifetime and so REDO
  2126. information is not needed on restart for recovery.
  2127. Disable locking as temp-tables are local to a connection. */
  2128. ut_ad(flags & BTR_NO_LOCKING_FLAG);
  2129. ut_ad(!dict_index_is_online_ddl(index));
  2130. ut_ad(!index->table->persistent_autoinc);
  2131. ut_ad(!index->is_instant());
  2132. mtr.set_log_mode(MTR_LOG_NO_REDO);
  2133. } else {
  2134. index->set_modified(mtr);
  2135. if (UNIV_UNLIKELY(entry->is_metadata())) {
  2136. ut_ad(index->is_instant());
  2137. ut_ad(!dict_index_is_online_ddl(index));
  2138. ut_ad(mode == BTR_MODIFY_TREE);
  2139. } else {
  2140. if (mode == BTR_MODIFY_LEAF
  2141. && dict_index_is_online_ddl(index)) {
  2142. mode = BTR_MODIFY_LEAF_ALREADY_S_LATCHED;
  2143. mtr_s_lock_index(index, &mtr);
  2144. }
  2145. if (unsigned ai = index->table->persistent_autoinc) {
  2146. /* Prepare to persist the AUTO_INCREMENT value
  2147. from the index entry to PAGE_ROOT_AUTO_INC. */
  2148. const dfield_t* dfield = dtuple_get_nth_field(
  2149. entry, ai - 1);
  2150. if (!dfield_is_null(dfield)) {
  2151. auto_inc = row_parse_int(
  2152. static_cast<const byte*>(
  2153. dfield->data),
  2154. dfield->len,
  2155. dfield->type.mtype,
  2156. dfield->type.prtype
  2157. & DATA_UNSIGNED);
  2158. }
  2159. }
  2160. }
  2161. }
  2162. /* Note that we use PAGE_CUR_LE as the search mode, because then
  2163. the function will return in both low_match and up_match of the
  2164. cursor sensible values */
  2165. err = btr_pcur_open_low(index, 0, entry, PAGE_CUR_LE, mode, &pcur,
  2166. __FILE__, __LINE__, auto_inc, &mtr);
  2167. if (err != DB_SUCCESS) {
  2168. index->table->file_unreadable = true;
  2169. mtr.commit();
  2170. goto func_exit;
  2171. }
  2172. cursor = btr_pcur_get_btr_cur(&pcur);
  2173. cursor->thr = thr;
  2174. #ifdef UNIV_DEBUG
  2175. {
  2176. page_t* page = btr_cur_get_page(cursor);
  2177. rec_t* first_rec = page_rec_get_next(
  2178. page_get_infimum_rec(page));
  2179. ut_ad(page_rec_is_supremum(first_rec)
  2180. || rec_n_fields_is_sane(index, first_rec, entry));
  2181. }
  2182. #endif /* UNIV_DEBUG */
  2183. if (UNIV_UNLIKELY(entry->info_bits != 0)) {
  2184. ut_ad(entry->is_metadata());
  2185. ut_ad(flags == BTR_NO_LOCKING_FLAG);
  2186. ut_ad(index->is_instant());
  2187. ut_ad(!dict_index_is_online_ddl(index));
  2188. const rec_t* rec = btr_cur_get_rec(cursor);
  2189. if (rec_get_info_bits(rec, page_rec_is_comp(rec))
  2190. & REC_INFO_MIN_REC_FLAG) {
  2191. thr_get_trx(thr)->error_info = index;
  2192. err = DB_DUPLICATE_KEY;
  2193. goto err_exit;
  2194. }
  2195. ut_ad(!row_ins_must_modify_rec(cursor));
  2196. goto do_insert;
  2197. }
  2198. if (rec_is_metadata(btr_cur_get_rec(cursor), *index)) {
  2199. goto do_insert;
  2200. }
  2201. if (n_uniq
  2202. && (cursor->up_match >= n_uniq || cursor->low_match >= n_uniq)) {
  2203. if (flags
  2204. == (BTR_CREATE_FLAG | BTR_NO_LOCKING_FLAG
  2205. | BTR_NO_UNDO_LOG_FLAG | BTR_KEEP_SYS_FLAG)) {
  2206. /* Set no locks when applying log
  2207. in online table rebuild. Only check for duplicates. */
  2208. err = row_ins_duplicate_error_in_clust_online(
  2209. n_uniq, entry, cursor,
  2210. &offsets, &offsets_heap);
  2211. switch (err) {
  2212. case DB_SUCCESS:
  2213. break;
  2214. default:
  2215. ut_ad(0);
  2216. /* fall through */
  2217. case DB_SUCCESS_LOCKED_REC:
  2218. case DB_DUPLICATE_KEY:
  2219. thr_get_trx(thr)->error_info = cursor->index;
  2220. }
  2221. } else {
  2222. /* Note that the following may return also
  2223. DB_LOCK_WAIT */
  2224. err = row_ins_duplicate_error_in_clust(
  2225. flags, cursor, entry, thr);
  2226. }
  2227. if (err != DB_SUCCESS) {
  2228. err_exit:
  2229. mtr_commit(&mtr);
  2230. goto func_exit;
  2231. }
  2232. }
  2233. /* Note: Allowing duplicates would qualify for modification of
  2234. an existing record as the new entry is exactly same as old entry. */
  2235. if (row_ins_must_modify_rec(cursor)) {
  2236. /* There is already an index entry with a long enough common
  2237. prefix, we must convert the insert into a modify of an
  2238. existing record */
  2239. mem_heap_t* entry_heap = mem_heap_create(1024);
  2240. err = row_ins_clust_index_entry_by_modify(
  2241. &pcur, flags, mode, &offsets, &offsets_heap,
  2242. entry_heap, entry, thr, &mtr);
  2243. if (err == DB_SUCCESS && dict_index_is_online_ddl(index)) {
  2244. row_log_table_insert(btr_cur_get_rec(cursor),
  2245. index, offsets);
  2246. }
  2247. mtr_commit(&mtr);
  2248. mem_heap_free(entry_heap);
  2249. } else {
  2250. if (index->is_instant()) entry->trim(*index);
  2251. do_insert:
  2252. rec_t* insert_rec;
  2253. if (mode != BTR_MODIFY_TREE) {
  2254. ut_ad((mode & ulint(~BTR_ALREADY_S_LATCHED))
  2255. == BTR_MODIFY_LEAF);
  2256. err = btr_cur_optimistic_insert(
  2257. flags, cursor, &offsets, &offsets_heap,
  2258. entry, &insert_rec, &big_rec,
  2259. n_ext, thr, &mtr);
  2260. } else {
  2261. if (buf_LRU_buf_pool_running_out()) {
  2262. err = DB_LOCK_TABLE_FULL;
  2263. goto err_exit;
  2264. }
  2265. DEBUG_SYNC_C("before_insert_pessimitic_row_ins_clust");
  2266. err = btr_cur_optimistic_insert(
  2267. flags, cursor,
  2268. &offsets, &offsets_heap,
  2269. entry, &insert_rec, &big_rec,
  2270. n_ext, thr, &mtr);
  2271. if (err == DB_FAIL) {
  2272. err = btr_cur_pessimistic_insert(
  2273. flags, cursor,
  2274. &offsets, &offsets_heap,
  2275. entry, &insert_rec, &big_rec,
  2276. n_ext, thr, &mtr);
  2277. }
  2278. }
  2279. if (big_rec != NULL) {
  2280. mtr_commit(&mtr);
  2281. /* Online table rebuild could read (and
  2282. ignore) the incomplete record at this point.
  2283. If online rebuild is in progress, the
  2284. row_ins_index_entry_big_rec() will write log. */
  2285. DBUG_EXECUTE_IF(
  2286. "row_ins_extern_checkpoint",
  2287. log_write_up_to(mtr.commit_lsn(), true););
  2288. err = row_ins_index_entry_big_rec(
  2289. entry, big_rec, offsets, &offsets_heap, index,
  2290. thr_get_trx(thr)->mysql_thd);
  2291. dtuple_convert_back_big_rec(index, entry, big_rec);
  2292. } else {
  2293. if (err == DB_SUCCESS
  2294. && dict_index_is_online_ddl(index)) {
  2295. row_log_table_insert(
  2296. insert_rec, index, offsets);
  2297. }
  2298. mtr_commit(&mtr);
  2299. }
  2300. }
  2301. func_exit:
  2302. if (offsets_heap != NULL) {
  2303. mem_heap_free(offsets_heap);
  2304. }
  2305. btr_pcur_close(&pcur);
  2306. DBUG_RETURN(err);
  2307. }
  2308. /** Start a mini-transaction and check if the index will be dropped.
  2309. @param[in,out] mtr mini-transaction
  2310. @param[in,out] index secondary index
  2311. @param[in] check whether to check
  2312. @param[in] search_mode flags
  2313. @return true if the index is to be dropped */
  2314. static MY_ATTRIBUTE((warn_unused_result))
  2315. bool
  2316. row_ins_sec_mtr_start_and_check_if_aborted(
  2317. mtr_t* mtr,
  2318. dict_index_t* index,
  2319. bool check,
  2320. ulint search_mode)
  2321. {
  2322. ut_ad(!dict_index_is_clust(index));
  2323. ut_ad(mtr->is_named_space(index->table->space));
  2324. const mtr_log_t log_mode = mtr->get_log_mode();
  2325. mtr->start();
  2326. index->set_modified(*mtr);
  2327. mtr->set_log_mode(log_mode);
  2328. if (!check) {
  2329. return(false);
  2330. }
  2331. if (search_mode & BTR_ALREADY_S_LATCHED) {
  2332. mtr_s_lock_index(index, mtr);
  2333. } else {
  2334. mtr_sx_lock_index(index, mtr);
  2335. }
  2336. switch (index->online_status) {
  2337. case ONLINE_INDEX_ABORTED:
  2338. case ONLINE_INDEX_ABORTED_DROPPED:
  2339. ut_ad(!index->is_committed());
  2340. return(true);
  2341. case ONLINE_INDEX_COMPLETE:
  2342. return(false);
  2343. case ONLINE_INDEX_CREATION:
  2344. break;
  2345. }
  2346. ut_error;
  2347. return(true);
  2348. }
  2349. /***************************************************************//**
  2350. Tries to insert an entry into a secondary index. If a record with exactly the
  2351. same fields is found, the other record is necessarily marked deleted.
  2352. It is then unmarked. Otherwise, the entry is just inserted to the index.
  2353. @retval DB_SUCCESS on success
  2354. @retval DB_LOCK_WAIT on lock wait when !(flags & BTR_NO_LOCKING_FLAG)
  2355. @retval DB_FAIL if retry with BTR_MODIFY_TREE is needed
  2356. @return error code */
  2357. dberr_t
  2358. row_ins_sec_index_entry_low(
  2359. /*========================*/
  2360. ulint flags, /*!< in: undo logging and locking flags */
  2361. ulint mode, /*!< in: BTR_MODIFY_LEAF or BTR_MODIFY_TREE,
  2362. depending on whether we wish optimistic or
  2363. pessimistic descent down the index tree */
  2364. dict_index_t* index, /*!< in: secondary index */
  2365. mem_heap_t* offsets_heap,
  2366. /*!< in/out: memory heap that can be emptied */
  2367. mem_heap_t* heap, /*!< in/out: memory heap */
  2368. dtuple_t* entry, /*!< in/out: index entry to insert */
  2369. trx_id_t trx_id, /*!< in: PAGE_MAX_TRX_ID during
  2370. row_log_table_apply(), or 0 */
  2371. que_thr_t* thr) /*!< in: query thread */
  2372. {
  2373. DBUG_ENTER("row_ins_sec_index_entry_low");
  2374. btr_cur_t cursor;
  2375. ulint search_mode = mode;
  2376. dberr_t err = DB_SUCCESS;
  2377. ulint n_unique;
  2378. mtr_t mtr;
  2379. rec_offs offsets_[REC_OFFS_NORMAL_SIZE];
  2380. rec_offs* offsets = offsets_;
  2381. rec_offs_init(offsets_);
  2382. rtr_info_t rtr_info;
  2383. ut_ad(!dict_index_is_clust(index));
  2384. ut_ad(mode == BTR_MODIFY_LEAF || mode == BTR_MODIFY_TREE);
  2385. cursor.thr = thr;
  2386. cursor.rtr_info = NULL;
  2387. ut_ad(thr_get_trx(thr)->id != 0);
  2388. mtr.start();
  2389. if (index->table->is_temporary()) {
  2390. /* Disable locking, because temporary tables are never
  2391. shared between transactions or connections. */
  2392. ut_ad(flags & BTR_NO_LOCKING_FLAG);
  2393. mtr.set_log_mode(MTR_LOG_NO_REDO);
  2394. } else {
  2395. index->set_modified(mtr);
  2396. if (!dict_index_is_spatial(index)) {
  2397. search_mode |= BTR_INSERT;
  2398. }
  2399. }
  2400. /* Ensure that we acquire index->lock when inserting into an
  2401. index with index->online_status == ONLINE_INDEX_COMPLETE, but
  2402. could still be subject to rollback_inplace_alter_table().
  2403. This prevents a concurrent change of index->online_status.
  2404. The memory object cannot be freed as long as we have an open
  2405. reference to the table, or index->table->n_ref_count > 0. */
  2406. const bool check = !index->is_committed();
  2407. if (check) {
  2408. DEBUG_SYNC_C("row_ins_sec_index_enter");
  2409. if (mode == BTR_MODIFY_LEAF) {
  2410. search_mode |= BTR_ALREADY_S_LATCHED;
  2411. mtr_s_lock_index(index, &mtr);
  2412. } else {
  2413. mtr_sx_lock_index(index, &mtr);
  2414. }
  2415. if (row_log_online_op_try(
  2416. index, entry, thr_get_trx(thr)->id)) {
  2417. goto func_exit;
  2418. }
  2419. }
  2420. /* Note that we use PAGE_CUR_LE as the search mode, because then
  2421. the function will return in both low_match and up_match of the
  2422. cursor sensible values */
  2423. if (!thr_get_trx(thr)->check_unique_secondary) {
  2424. search_mode |= BTR_IGNORE_SEC_UNIQUE;
  2425. }
  2426. if (dict_index_is_spatial(index)) {
  2427. cursor.index = index;
  2428. rtr_init_rtr_info(&rtr_info, false, &cursor, index, false);
  2429. rtr_info_update_btr(&cursor, &rtr_info);
  2430. err = btr_cur_search_to_nth_level(
  2431. index, 0, entry, PAGE_CUR_RTREE_INSERT,
  2432. search_mode,
  2433. &cursor, 0, __FILE__, __LINE__, &mtr);
  2434. if (mode == BTR_MODIFY_LEAF && rtr_info.mbr_adj) {
  2435. mtr_commit(&mtr);
  2436. rtr_clean_rtr_info(&rtr_info, true);
  2437. rtr_init_rtr_info(&rtr_info, false, &cursor,
  2438. index, false);
  2439. rtr_info_update_btr(&cursor, &rtr_info);
  2440. mtr_start(&mtr);
  2441. index->set_modified(mtr);
  2442. search_mode &= ulint(~BTR_MODIFY_LEAF);
  2443. search_mode |= BTR_MODIFY_TREE;
  2444. err = btr_cur_search_to_nth_level(
  2445. index, 0, entry, PAGE_CUR_RTREE_INSERT,
  2446. search_mode,
  2447. &cursor, 0, __FILE__, __LINE__, &mtr);
  2448. mode = BTR_MODIFY_TREE;
  2449. }
  2450. DBUG_EXECUTE_IF(
  2451. "rtree_test_check_count", {
  2452. goto func_exit;});
  2453. } else {
  2454. err = btr_cur_search_to_nth_level(
  2455. index, 0, entry, PAGE_CUR_LE,
  2456. search_mode,
  2457. &cursor, 0, __FILE__, __LINE__, &mtr);
  2458. }
  2459. if (err != DB_SUCCESS) {
  2460. if (err == DB_DECRYPTION_FAILED) {
  2461. ib_push_warning(thr_get_trx(thr)->mysql_thd,
  2462. DB_DECRYPTION_FAILED,
  2463. "Table %s is encrypted but encryption service or"
  2464. " used key_id is not available. "
  2465. " Can't continue reading table.",
  2466. index->table->name.m_name);
  2467. index->table->file_unreadable = true;
  2468. }
  2469. goto func_exit;
  2470. }
  2471. if (cursor.flag == BTR_CUR_INSERT_TO_IBUF) {
  2472. ut_ad(!dict_index_is_spatial(index));
  2473. /* The insert was buffered during the search: we are done */
  2474. goto func_exit;
  2475. }
  2476. #ifdef UNIV_DEBUG
  2477. {
  2478. page_t* page = btr_cur_get_page(&cursor);
  2479. rec_t* first_rec = page_rec_get_next(
  2480. page_get_infimum_rec(page));
  2481. ut_ad(page_rec_is_supremum(first_rec)
  2482. || rec_n_fields_is_sane(index, first_rec, entry));
  2483. }
  2484. #endif /* UNIV_DEBUG */
  2485. n_unique = dict_index_get_n_unique(index);
  2486. if (dict_index_is_unique(index)
  2487. && (cursor.low_match >= n_unique || cursor.up_match >= n_unique)) {
  2488. mtr_commit(&mtr);
  2489. DEBUG_SYNC_C("row_ins_sec_index_unique");
  2490. if (row_ins_sec_mtr_start_and_check_if_aborted(
  2491. &mtr, index, check, search_mode)) {
  2492. goto func_exit;
  2493. }
  2494. err = row_ins_scan_sec_index_for_duplicate(
  2495. flags, index, entry, thr, check, &mtr, offsets_heap);
  2496. mtr_commit(&mtr);
  2497. switch (err) {
  2498. case DB_SUCCESS:
  2499. break;
  2500. case DB_DUPLICATE_KEY:
  2501. if (!index->is_committed()) {
  2502. ut_ad(!thr_get_trx(thr)
  2503. ->dict_operation_lock_mode);
  2504. mutex_enter(&dict_sys.mutex);
  2505. dict_set_corrupted_index_cache_only(index);
  2506. mutex_exit(&dict_sys.mutex);
  2507. /* Do not return any error to the
  2508. caller. The duplicate will be reported
  2509. by ALTER TABLE or CREATE UNIQUE INDEX.
  2510. Unfortunately we cannot report the
  2511. duplicate key value to the DDL thread,
  2512. because the altered_table object is
  2513. private to its call stack. */
  2514. err = DB_SUCCESS;
  2515. }
  2516. /* fall through */
  2517. default:
  2518. if (dict_index_is_spatial(index)) {
  2519. rtr_clean_rtr_info(&rtr_info, true);
  2520. }
  2521. DBUG_RETURN(err);
  2522. }
  2523. if (row_ins_sec_mtr_start_and_check_if_aborted(
  2524. &mtr, index, check, search_mode)) {
  2525. goto func_exit;
  2526. }
  2527. DEBUG_SYNC_C("row_ins_sec_index_entry_dup_locks_created");
  2528. /* We did not find a duplicate and we have now
  2529. locked with s-locks the necessary records to
  2530. prevent any insertion of a duplicate by another
  2531. transaction. Let us now reposition the cursor and
  2532. continue the insertion. */
  2533. btr_cur_search_to_nth_level(
  2534. index, 0, entry, PAGE_CUR_LE,
  2535. (search_mode
  2536. & ~(BTR_INSERT | BTR_IGNORE_SEC_UNIQUE)),
  2537. &cursor, 0, __FILE__, __LINE__, &mtr);
  2538. }
  2539. if (row_ins_must_modify_rec(&cursor)) {
  2540. /* There is already an index entry with a long enough common
  2541. prefix, we must convert the insert into a modify of an
  2542. existing record */
  2543. offsets = rec_get_offsets(
  2544. btr_cur_get_rec(&cursor), index, offsets,
  2545. index->n_core_fields,
  2546. ULINT_UNDEFINED, &offsets_heap);
  2547. err = row_ins_sec_index_entry_by_modify(
  2548. flags, mode, &cursor, &offsets,
  2549. offsets_heap, heap, entry, thr, &mtr);
  2550. if (err == DB_SUCCESS && dict_index_is_spatial(index)
  2551. && rtr_info.mbr_adj) {
  2552. err = rtr_ins_enlarge_mbr(&cursor, &mtr);
  2553. }
  2554. } else {
  2555. rec_t* insert_rec;
  2556. big_rec_t* big_rec;
  2557. if (mode == BTR_MODIFY_LEAF) {
  2558. err = btr_cur_optimistic_insert(
  2559. flags, &cursor, &offsets, &offsets_heap,
  2560. entry, &insert_rec,
  2561. &big_rec, 0, thr, &mtr);
  2562. if (err == DB_SUCCESS
  2563. && dict_index_is_spatial(index)
  2564. && rtr_info.mbr_adj) {
  2565. err = rtr_ins_enlarge_mbr(&cursor, &mtr);
  2566. }
  2567. } else {
  2568. ut_ad(mode == BTR_MODIFY_TREE);
  2569. if (buf_LRU_buf_pool_running_out()) {
  2570. err = DB_LOCK_TABLE_FULL;
  2571. goto func_exit;
  2572. }
  2573. err = btr_cur_optimistic_insert(
  2574. flags, &cursor,
  2575. &offsets, &offsets_heap,
  2576. entry, &insert_rec,
  2577. &big_rec, 0, thr, &mtr);
  2578. if (err == DB_FAIL) {
  2579. err = btr_cur_pessimistic_insert(
  2580. flags, &cursor,
  2581. &offsets, &offsets_heap,
  2582. entry, &insert_rec,
  2583. &big_rec, 0, thr, &mtr);
  2584. }
  2585. if (err == DB_SUCCESS
  2586. && dict_index_is_spatial(index)
  2587. && rtr_info.mbr_adj) {
  2588. err = rtr_ins_enlarge_mbr(&cursor, &mtr);
  2589. }
  2590. }
  2591. if (err == DB_SUCCESS && trx_id) {
  2592. page_update_max_trx_id(
  2593. btr_cur_get_block(&cursor),
  2594. btr_cur_get_page_zip(&cursor),
  2595. trx_id, &mtr);
  2596. }
  2597. ut_ad(!big_rec);
  2598. }
  2599. func_exit:
  2600. if (dict_index_is_spatial(index)) {
  2601. rtr_clean_rtr_info(&rtr_info, true);
  2602. }
  2603. mtr_commit(&mtr);
  2604. DBUG_RETURN(err);
  2605. }
  2606. /***************************************************************//**
  2607. Inserts an entry into a clustered index. Tries first optimistic,
  2608. then pessimistic descent down the tree. If the entry matches enough
  2609. to a delete marked record, performs the insert by updating or delete
  2610. unmarking the delete marked record.
  2611. @return DB_SUCCESS, DB_LOCK_WAIT, DB_DUPLICATE_KEY, or some other error code */
  2612. dberr_t
  2613. row_ins_clust_index_entry(
  2614. /*======================*/
  2615. dict_index_t* index, /*!< in: clustered index */
  2616. dtuple_t* entry, /*!< in/out: index entry to insert */
  2617. que_thr_t* thr, /*!< in: query thread */
  2618. ulint n_ext) /*!< in: number of externally stored columns */
  2619. {
  2620. dberr_t err;
  2621. ulint n_uniq;
  2622. DBUG_ENTER("row_ins_clust_index_entry");
  2623. if (!index->table->foreign_set.empty()) {
  2624. err = row_ins_check_foreign_constraints(
  2625. index->table, index, true, entry, thr);
  2626. if (err != DB_SUCCESS) {
  2627. DBUG_RETURN(err);
  2628. }
  2629. }
  2630. n_uniq = dict_index_is_unique(index) ? index->n_uniq : 0;
  2631. #ifdef WITH_WSREP
  2632. const bool skip_locking
  2633. = wsrep_thd_skip_locking(thr_get_trx(thr)->mysql_thd);
  2634. ulint flags = index->table->no_rollback() ? BTR_NO_ROLLBACK
  2635. : (index->table->is_temporary() || skip_locking)
  2636. ? BTR_NO_LOCKING_FLAG : 0;
  2637. #ifdef UNIV_DEBUG
  2638. if (skip_locking && strcmp(wsrep_get_sr_table_name(),
  2639. index->table->name.m_name)) {
  2640. WSREP_ERROR("Record locking is disabled in this thread, "
  2641. "but the table being modified is not "
  2642. "`%s`: `%s`.", wsrep_get_sr_table_name(),
  2643. index->table->name.m_name);
  2644. ut_error;
  2645. }
  2646. #endif /* UNIV_DEBUG */
  2647. #else
  2648. ulint flags = index->table->no_rollback() ? BTR_NO_ROLLBACK
  2649. : index->table->is_temporary()
  2650. ? BTR_NO_LOCKING_FLAG : 0;
  2651. #endif /* WITH_WSREP */
  2652. const ulint orig_n_fields = entry->n_fields;
  2653. /* Try first optimistic descent to the B-tree */
  2654. log_free_check();
  2655. /* For intermediate table during copy alter table,
  2656. skip the undo log and record lock checking for
  2657. insertion operation.
  2658. */
  2659. if (index->table->skip_alter_undo) {
  2660. flags |= BTR_NO_UNDO_LOG_FLAG | BTR_NO_LOCKING_FLAG;
  2661. }
  2662. /* Try first optimistic descent to the B-tree */
  2663. log_free_check();
  2664. err = row_ins_clust_index_entry_low(
  2665. flags, BTR_MODIFY_LEAF, index, n_uniq, entry,
  2666. n_ext, thr);
  2667. entry->n_fields = orig_n_fields;
  2668. DEBUG_SYNC_C_IF_THD(thr_get_trx(thr)->mysql_thd,
  2669. "after_row_ins_clust_index_entry_leaf");
  2670. if (err != DB_FAIL) {
  2671. DEBUG_SYNC_C("row_ins_clust_index_entry_leaf_after");
  2672. DBUG_RETURN(err);
  2673. }
  2674. /* Try then pessimistic descent to the B-tree */
  2675. log_free_check();
  2676. err = row_ins_clust_index_entry_low(
  2677. flags, BTR_MODIFY_TREE, index, n_uniq, entry,
  2678. n_ext, thr);
  2679. entry->n_fields = orig_n_fields;
  2680. DBUG_RETURN(err);
  2681. }
  2682. /***************************************************************//**
  2683. Inserts an entry into a secondary index. Tries first optimistic,
  2684. then pessimistic descent down the tree. If the entry matches enough
  2685. to a delete marked record, performs the insert by updating or delete
  2686. unmarking the delete marked record.
  2687. @return DB_SUCCESS, DB_LOCK_WAIT, DB_DUPLICATE_KEY, or some other error code */
  2688. dberr_t
  2689. row_ins_sec_index_entry(
  2690. /*====================*/
  2691. dict_index_t* index, /*!< in: secondary index */
  2692. dtuple_t* entry, /*!< in/out: index entry to insert */
  2693. que_thr_t* thr, /*!< in: query thread */
  2694. bool check_foreign) /*!< in: true if check
  2695. foreign table is needed, false otherwise */
  2696. {
  2697. dberr_t err;
  2698. mem_heap_t* offsets_heap;
  2699. mem_heap_t* heap;
  2700. trx_id_t trx_id = 0;
  2701. DBUG_EXECUTE_IF("row_ins_sec_index_entry_timeout", {
  2702. DBUG_SET("-d,row_ins_sec_index_entry_timeout");
  2703. return(DB_LOCK_WAIT);});
  2704. if (check_foreign && !index->table->foreign_set.empty()) {
  2705. err = row_ins_check_foreign_constraints(index->table, index,
  2706. false, entry, thr);
  2707. if (err != DB_SUCCESS) {
  2708. return(err);
  2709. }
  2710. }
  2711. ut_ad(thr_get_trx(thr)->id != 0);
  2712. offsets_heap = mem_heap_create(1024);
  2713. heap = mem_heap_create(1024);
  2714. /* Try first optimistic descent to the B-tree */
  2715. log_free_check();
  2716. ulint flags = index->table->is_temporary()
  2717. ? BTR_NO_LOCKING_FLAG
  2718. : 0;
  2719. /* For intermediate table during copy alter table,
  2720. skip the undo log and record lock checking for
  2721. insertion operation.
  2722. */
  2723. if (index->table->skip_alter_undo) {
  2724. trx_id = thr_get_trx(thr)->id;
  2725. flags |= BTR_NO_UNDO_LOG_FLAG | BTR_NO_LOCKING_FLAG;
  2726. }
  2727. err = row_ins_sec_index_entry_low(
  2728. flags, BTR_MODIFY_LEAF, index, offsets_heap, heap, entry,
  2729. trx_id, thr);
  2730. if (err == DB_FAIL) {
  2731. mem_heap_empty(heap);
  2732. if (index->table->space == fil_system.sys_space
  2733. && !(index->type & (DICT_UNIQUE | DICT_SPATIAL))) {
  2734. ibuf_free_excess_pages();
  2735. }
  2736. /* Try then pessimistic descent to the B-tree */
  2737. log_free_check();
  2738. err = row_ins_sec_index_entry_low(
  2739. flags, BTR_MODIFY_TREE, index,
  2740. offsets_heap, heap, entry, 0, thr);
  2741. }
  2742. mem_heap_free(heap);
  2743. mem_heap_free(offsets_heap);
  2744. return(err);
  2745. }
  2746. /***************************************************************//**
  2747. Inserts an index entry to index. Tries first optimistic, then pessimistic
  2748. descent down the tree. If the entry matches enough to a delete marked record,
  2749. performs the insert by updating or delete unmarking the delete marked
  2750. record.
  2751. @return DB_SUCCESS, DB_LOCK_WAIT, DB_DUPLICATE_KEY, or some other error code */
  2752. static
  2753. dberr_t
  2754. row_ins_index_entry(
  2755. /*================*/
  2756. dict_index_t* index, /*!< in: index */
  2757. dtuple_t* entry, /*!< in/out: index entry to insert */
  2758. que_thr_t* thr) /*!< in: query thread */
  2759. {
  2760. ut_ad(thr_get_trx(thr)->id || index->table->no_rollback()
  2761. || index->table->is_temporary());
  2762. DBUG_EXECUTE_IF("row_ins_index_entry_timeout", {
  2763. DBUG_SET("-d,row_ins_index_entry_timeout");
  2764. return(DB_LOCK_WAIT);});
  2765. if (index->is_primary()) {
  2766. return row_ins_clust_index_entry(index, entry, thr, 0);
  2767. } else {
  2768. return row_ins_sec_index_entry(index, entry, thr);
  2769. }
  2770. }
  2771. /*****************************************************************//**
  2772. This function generate MBR (Minimum Bounding Box) for spatial objects
  2773. and set it to spatial index field. */
  2774. static
  2775. void
  2776. row_ins_spatial_index_entry_set_mbr_field(
  2777. /*======================================*/
  2778. dfield_t* field, /*!< in/out: mbr field */
  2779. const dfield_t* row_field) /*!< in: row field */
  2780. {
  2781. ulint dlen = 0;
  2782. double mbr[SPDIMS * 2];
  2783. /* This must be a GEOMETRY datatype */
  2784. ut_ad(DATA_GEOMETRY_MTYPE(field->type.mtype));
  2785. const byte* dptr = static_cast<const byte*>(
  2786. dfield_get_data(row_field));
  2787. dlen = dfield_get_len(row_field);
  2788. /* obtain the MBR */
  2789. rtree_mbr_from_wkb(dptr + GEO_DATA_HEADER_SIZE,
  2790. static_cast<uint>(dlen - GEO_DATA_HEADER_SIZE),
  2791. SPDIMS, mbr);
  2792. /* Set mbr as index entry data */
  2793. dfield_write_mbr(field, mbr);
  2794. }
  2795. /** Sets the values of the dtuple fields in entry from the values of appropriate
  2796. columns in row.
  2797. @param[in] index index handler
  2798. @param[out] entry index entry to make
  2799. @param[in] row row
  2800. @return DB_SUCCESS if the set is successful */
  2801. static
  2802. dberr_t
  2803. row_ins_index_entry_set_vals(
  2804. const dict_index_t* index,
  2805. dtuple_t* entry,
  2806. const dtuple_t* row)
  2807. {
  2808. ulint n_fields;
  2809. ulint i;
  2810. ulint num_v = dtuple_get_n_v_fields(entry);
  2811. n_fields = dtuple_get_n_fields(entry);
  2812. for (i = 0; i < n_fields + num_v; i++) {
  2813. dict_field_t* ind_field = NULL;
  2814. dfield_t* field;
  2815. const dfield_t* row_field;
  2816. ulint len;
  2817. dict_col_t* col;
  2818. if (i >= n_fields) {
  2819. /* This is virtual field */
  2820. field = dtuple_get_nth_v_field(entry, i - n_fields);
  2821. col = &dict_table_get_nth_v_col(
  2822. index->table, i - n_fields)->m_col;
  2823. } else {
  2824. field = dtuple_get_nth_field(entry, i);
  2825. ind_field = dict_index_get_nth_field(index, i);
  2826. col = ind_field->col;
  2827. }
  2828. if (col->is_virtual()) {
  2829. const dict_v_col_t* v_col
  2830. = reinterpret_cast<const dict_v_col_t*>(col);
  2831. ut_ad(dtuple_get_n_fields(row)
  2832. == dict_table_get_n_cols(index->table));
  2833. row_field = dtuple_get_nth_v_field(row, v_col->v_pos);
  2834. } else if (col->is_dropped()) {
  2835. ut_ad(index->is_primary());
  2836. if (!(col->prtype & DATA_NOT_NULL)) {
  2837. field->data = NULL;
  2838. field->len = UNIV_SQL_NULL;
  2839. field->type.prtype = DATA_BINARY_TYPE;
  2840. } else {
  2841. ut_ad(col->len <= sizeof field_ref_zero);
  2842. ut_ad(ind_field->fixed_len <= col->len);
  2843. dfield_set_data(field, field_ref_zero,
  2844. ind_field->fixed_len);
  2845. field->type.prtype = DATA_NOT_NULL;
  2846. }
  2847. field->type.mtype = col->len
  2848. ? DATA_FIXBINARY : DATA_BINARY;
  2849. continue;
  2850. } else {
  2851. row_field = dtuple_get_nth_field(
  2852. row, ind_field->col->ind);
  2853. }
  2854. len = dfield_get_len(row_field);
  2855. /* Check column prefix indexes */
  2856. if (ind_field != NULL && ind_field->prefix_len > 0
  2857. && len != UNIV_SQL_NULL) {
  2858. const dict_col_t* col
  2859. = dict_field_get_col(ind_field);
  2860. len = dtype_get_at_most_n_mbchars(
  2861. col->prtype, col->mbminlen, col->mbmaxlen,
  2862. ind_field->prefix_len,
  2863. len,
  2864. static_cast<const char*>(
  2865. dfield_get_data(row_field)));
  2866. ut_ad(!dfield_is_ext(row_field));
  2867. }
  2868. /* Handle spatial index. For the first field, replace
  2869. the data with its MBR (Minimum Bounding Box). */
  2870. if ((i == 0) && dict_index_is_spatial(index)) {
  2871. if (!row_field->data
  2872. || row_field->len < GEO_DATA_HEADER_SIZE) {
  2873. return(DB_CANT_CREATE_GEOMETRY_OBJECT);
  2874. }
  2875. row_ins_spatial_index_entry_set_mbr_field(
  2876. field, row_field);
  2877. continue;
  2878. }
  2879. dfield_set_data(field, dfield_get_data(row_field), len);
  2880. if (dfield_is_ext(row_field)) {
  2881. ut_ad(dict_index_is_clust(index));
  2882. dfield_set_ext(field);
  2883. }
  2884. }
  2885. return(DB_SUCCESS);
  2886. }
  2887. /***********************************************************//**
  2888. Inserts a single index entry to the table.
  2889. @return DB_SUCCESS if operation successfully completed, else error
  2890. code or DB_LOCK_WAIT */
  2891. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  2892. dberr_t
  2893. row_ins_index_entry_step(
  2894. /*=====================*/
  2895. ins_node_t* node, /*!< in: row insert node */
  2896. que_thr_t* thr) /*!< in: query thread */
  2897. {
  2898. dberr_t err;
  2899. DBUG_ENTER("row_ins_index_entry_step");
  2900. ut_ad(dtuple_check_typed(node->row));
  2901. err = row_ins_index_entry_set_vals(node->index, *node->entry,
  2902. node->row);
  2903. if (err != DB_SUCCESS) {
  2904. DBUG_RETURN(err);
  2905. }
  2906. ut_ad(dtuple_check_typed(*node->entry));
  2907. err = row_ins_index_entry(node->index, *node->entry, thr);
  2908. DEBUG_SYNC_C_IF_THD(thr_get_trx(thr)->mysql_thd,
  2909. "after_row_ins_index_entry_step");
  2910. DBUG_RETURN(err);
  2911. }
  2912. /***********************************************************//**
  2913. Allocates a row id for row and inits the node->index field. */
  2914. UNIV_INLINE
  2915. void
  2916. row_ins_alloc_row_id_step(
  2917. /*======================*/
  2918. ins_node_t* node) /*!< in: row insert node */
  2919. {
  2920. row_id_t row_id;
  2921. ut_ad(node->state == INS_NODE_ALLOC_ROW_ID);
  2922. if (dict_index_is_unique(dict_table_get_first_index(node->table))) {
  2923. /* No row id is stored if the clustered index is unique */
  2924. return;
  2925. }
  2926. /* Fill in row id value to row */
  2927. row_id = dict_sys_get_new_row_id();
  2928. dict_sys_write_row_id(node->sys_buf, row_id);
  2929. }
  2930. /***********************************************************//**
  2931. Gets a row to insert from the values list. */
  2932. UNIV_INLINE
  2933. void
  2934. row_ins_get_row_from_values(
  2935. /*========================*/
  2936. ins_node_t* node) /*!< in: row insert node */
  2937. {
  2938. que_node_t* list_node;
  2939. dfield_t* dfield;
  2940. dtuple_t* row;
  2941. ulint i;
  2942. /* The field values are copied in the buffers of the select node and
  2943. it is safe to use them until we fetch from select again: therefore
  2944. we can just copy the pointers */
  2945. row = node->row;
  2946. i = 0;
  2947. list_node = node->values_list;
  2948. while (list_node) {
  2949. eval_exp(list_node);
  2950. dfield = dtuple_get_nth_field(row, i);
  2951. dfield_copy_data(dfield, que_node_get_val(list_node));
  2952. i++;
  2953. list_node = que_node_get_next(list_node);
  2954. }
  2955. }
  2956. /***********************************************************//**
  2957. Gets a row to insert from the select list. */
  2958. UNIV_INLINE
  2959. void
  2960. row_ins_get_row_from_select(
  2961. /*========================*/
  2962. ins_node_t* node) /*!< in: row insert node */
  2963. {
  2964. que_node_t* list_node;
  2965. dfield_t* dfield;
  2966. dtuple_t* row;
  2967. ulint i;
  2968. /* The field values are copied in the buffers of the select node and
  2969. it is safe to use them until we fetch from select again: therefore
  2970. we can just copy the pointers */
  2971. row = node->row;
  2972. i = 0;
  2973. list_node = node->select->select_list;
  2974. while (list_node) {
  2975. dfield = dtuple_get_nth_field(row, i);
  2976. dfield_copy_data(dfield, que_node_get_val(list_node));
  2977. i++;
  2978. list_node = que_node_get_next(list_node);
  2979. }
  2980. }
  2981. inline
  2982. bool ins_node_t::vers_history_row() const
  2983. {
  2984. if (!table->versioned())
  2985. return false;
  2986. dfield_t* row_end = dtuple_get_nth_field(row, table->vers_end);
  2987. return row_end->vers_history_row();
  2988. }
  2989. /***********************************************************//**
  2990. Inserts a row to a table.
  2991. @return DB_SUCCESS if operation successfully completed, else error
  2992. code or DB_LOCK_WAIT */
  2993. static MY_ATTRIBUTE((nonnull, warn_unused_result))
  2994. dberr_t
  2995. row_ins(
  2996. /*====*/
  2997. ins_node_t* node, /*!< in: row insert node */
  2998. que_thr_t* thr) /*!< in: query thread */
  2999. {
  3000. DBUG_ENTER("row_ins");
  3001. DBUG_PRINT("row_ins", ("table: %s", node->table->name.m_name));
  3002. if (node->state == INS_NODE_ALLOC_ROW_ID) {
  3003. row_ins_alloc_row_id_step(node);
  3004. node->index = dict_table_get_first_index(node->table);
  3005. ut_ad(node->entry_list.empty() == false);
  3006. node->entry = node->entry_list.begin();
  3007. if (node->ins_type == INS_SEARCHED) {
  3008. row_ins_get_row_from_select(node);
  3009. } else if (node->ins_type == INS_VALUES) {
  3010. row_ins_get_row_from_values(node);
  3011. }
  3012. node->state = INS_NODE_INSERT_ENTRIES;
  3013. }
  3014. ut_ad(node->state == INS_NODE_INSERT_ENTRIES);
  3015. while (node->index != NULL) {
  3016. dict_index_t *index = node->index;
  3017. /*
  3018. We do not insert history rows into FTS_DOC_ID_INDEX because
  3019. it is unique by FTS_DOC_ID only and we do not want to add
  3020. row_end to unique key. Fulltext field works the way new
  3021. FTS_DOC_ID is created on every fulltext UPDATE, so holding only
  3022. FTS_DOC_ID for history is enough.
  3023. */
  3024. const unsigned type = index->type;
  3025. if (index->type & DICT_FTS) {
  3026. } else if (!(type & DICT_UNIQUE) || index->n_uniq > 1
  3027. || !node->vers_history_row()) {
  3028. dberr_t err = row_ins_index_entry_step(node, thr);
  3029. if (err != DB_SUCCESS) {
  3030. DBUG_RETURN(err);
  3031. }
  3032. } else {
  3033. /* Unique indexes with system versioning must contain
  3034. the version end column. The only exception is a hidden
  3035. FTS_DOC_ID_INDEX that InnoDB may create on a hidden or
  3036. user-created FTS_DOC_ID column. */
  3037. ut_ad(!strcmp(index->name, FTS_DOC_ID_INDEX_NAME));
  3038. ut_ad(!strcmp(index->fields[0].name, FTS_DOC_ID_COL_NAME));
  3039. }
  3040. node->index = dict_table_get_next_index(node->index);
  3041. ++node->entry;
  3042. /* Skip corrupted secondary index and its entry */
  3043. while (node->index && node->index->is_corrupted()) {
  3044. node->index = dict_table_get_next_index(node->index);
  3045. ++node->entry;
  3046. }
  3047. }
  3048. ut_ad(node->entry == node->entry_list.end());
  3049. node->state = INS_NODE_ALLOC_ROW_ID;
  3050. DBUG_RETURN(DB_SUCCESS);
  3051. }
  3052. /***********************************************************//**
  3053. Inserts a row to a table. This is a high-level function used in SQL execution
  3054. graphs.
  3055. @return query thread to run next or NULL */
  3056. que_thr_t*
  3057. row_ins_step(
  3058. /*=========*/
  3059. que_thr_t* thr) /*!< in: query thread */
  3060. {
  3061. ins_node_t* node;
  3062. que_node_t* parent;
  3063. sel_node_t* sel_node;
  3064. trx_t* trx;
  3065. dberr_t err;
  3066. ut_ad(thr);
  3067. DEBUG_SYNC_C("innodb_row_ins_step_enter");
  3068. trx = thr_get_trx(thr);
  3069. node = static_cast<ins_node_t*>(thr->run_node);
  3070. ut_ad(que_node_get_type(node) == QUE_NODE_INSERT);
  3071. parent = que_node_get_parent(node);
  3072. sel_node = node->select;
  3073. if (thr->prev_node == parent) {
  3074. node->state = INS_NODE_SET_IX_LOCK;
  3075. }
  3076. /* If this is the first time this node is executed (or when
  3077. execution resumes after wait for the table IX lock), set an
  3078. IX lock on the table and reset the possible select node. MySQL's
  3079. partitioned table code may also call an insert within the same
  3080. SQL statement AFTER it has used this table handle to do a search.
  3081. This happens, for example, when a row update moves it to another
  3082. partition. In that case, we have already set the IX lock on the
  3083. table during the search operation, and there is no need to set
  3084. it again here. But we must write trx->id to node->sys_buf. */
  3085. if (node->table->no_rollback()) {
  3086. /* No-rollback tables should only be written to by a
  3087. single thread at a time, but there can be multiple
  3088. concurrent readers. We must hold an open table handle. */
  3089. DBUG_ASSERT(node->table->get_ref_count() > 0);
  3090. DBUG_ASSERT(node->ins_type == INS_DIRECT);
  3091. /* No-rollback tables can consist only of a single index. */
  3092. DBUG_ASSERT(node->entry_list.size() == 1);
  3093. DBUG_ASSERT(UT_LIST_GET_LEN(node->table->indexes) == 1);
  3094. /* There should be no possibility for interruption and
  3095. restarting here. In theory, we could allow resumption
  3096. from the INS_NODE_INSERT_ENTRIES state here. */
  3097. DBUG_ASSERT(node->state == INS_NODE_SET_IX_LOCK);
  3098. node->index = dict_table_get_first_index(node->table);
  3099. node->entry = node->entry_list.begin();
  3100. node->state = INS_NODE_INSERT_ENTRIES;
  3101. goto do_insert;
  3102. }
  3103. if (UNIV_LIKELY(!node->table->skip_alter_undo)) {
  3104. trx_write_trx_id(&node->sys_buf[DATA_TRX_ID_LEN], trx->id);
  3105. }
  3106. if (node->state == INS_NODE_SET_IX_LOCK) {
  3107. node->state = INS_NODE_ALLOC_ROW_ID;
  3108. if (node->table->is_temporary()) {
  3109. node->trx_id = trx->id;
  3110. }
  3111. /* It may be that the current session has not yet started
  3112. its transaction, or it has been committed: */
  3113. if (trx->id == node->trx_id) {
  3114. /* No need to do IX-locking */
  3115. goto same_trx;
  3116. }
  3117. err = lock_table(0, node->table, LOCK_IX, thr);
  3118. DBUG_EXECUTE_IF("ib_row_ins_ix_lock_wait",
  3119. err = DB_LOCK_WAIT;);
  3120. if (err != DB_SUCCESS) {
  3121. goto error_handling;
  3122. }
  3123. node->trx_id = trx->id;
  3124. same_trx:
  3125. if (node->ins_type == INS_SEARCHED) {
  3126. /* Reset the cursor */
  3127. sel_node->state = SEL_NODE_OPEN;
  3128. /* Fetch a row to insert */
  3129. thr->run_node = sel_node;
  3130. return(thr);
  3131. }
  3132. }
  3133. if ((node->ins_type == INS_SEARCHED)
  3134. && (sel_node->state != SEL_NODE_FETCH)) {
  3135. ut_ad(sel_node->state == SEL_NODE_NO_MORE_ROWS);
  3136. /* No more rows to insert */
  3137. thr->run_node = parent;
  3138. return(thr);
  3139. }
  3140. do_insert:
  3141. /* DO THE CHECKS OF THE CONSISTENCY CONSTRAINTS HERE */
  3142. err = row_ins(node, thr);
  3143. error_handling:
  3144. trx->error_state = err;
  3145. if (err != DB_SUCCESS) {
  3146. /* err == DB_LOCK_WAIT or SQL error detected */
  3147. return(NULL);
  3148. }
  3149. /* DO THE TRIGGER ACTIONS HERE */
  3150. if (node->ins_type == INS_SEARCHED) {
  3151. /* Fetch a row to insert */
  3152. thr->run_node = sel_node;
  3153. } else {
  3154. thr->run_node = que_node_get_parent(node);
  3155. }
  3156. return(thr);
  3157. }