You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1644 lines
46 KiB

Bug#12704861 Corruption after a crash during BLOB update The fix of Bug#12612184 broke crash recovery. When a record that contains off-page columns (BLOBs) is updated, we must first write redo log about the BLOB page writes, and only after that write the redo log about the B-tree changes. The buggy fix would log the B-tree changes first, meaning that after recovery, we could end up having a record that contains a null BLOB pointer. Because we will be redo logging the writes off the off-page columns before the B-tree changes, we must make sure that the pages chosen for the off-page columns are free both before and after the B-tree changes. In this way, the worst thing that can happen in crash recovery is that the BLOBs are written to free pages, but the B-tree changes are not applied. The BLOB pages would correctly remain free in this case. To achieve this, we must allocate the BLOB pages in the mini-transaction of the B-tree operation. A further quirk is that BLOB pages are allocated from the same file segment as leaf pages. Because of this, we must temporarily "hide" any leaf pages that were freed during the B-tree operation by "fake allocating" them prior to writing the BLOBs, and freeing them again before the mtr_commit() of the B-tree operation, in btr_mark_freed_leaves(). btr_cur_mtr_commit_and_start(): Remove this faulty function that was introduced in the Bug#12612184 fix. The problem that this function was trying to address was that when we did mtr_commit() the BLOB writes before the mtr_commit() of the update, the new BLOB pages could have overwritten clustered index B-tree leaf pages that were freed during the update. If recovery applied the redo log of the BLOB writes but did not see the log of the record update, the index tree would be corrupted. The correct solution is to make the freed clustered index pages unavailable to the BLOB allocation. This function is also a likely culprit of InnoDB hangs that were observed when testing the Bug#12612184 fix. btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs, or freed (nonfree=FALSE) before committing the mini-transaction. btr_freed_leaves_validate(): A debug function for checking that all clustered index leaf pages that have been marked free in the mini-transaction are consistent (have not been zeroed out). btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the number of the allocated page, or FIL_NULL if out of space. Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or if this is a "fake allocation" (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE). btr_page_alloc(): Add the parameter init_mtr, allowing the page to be initialized and X-latched in a different mini-transaction than the one that is used for the allocation. Invoke btr_page_alloc_low(). If a clustered index leaf page was previously freed in mtr, remove it from the memo of previously freed pages. btr_page_free(): Assert that the page is a B-tree page and it has been X-latched by the mini-transaction. If the freed page was a leaf page of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to the mini-transaction. btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr, which is NULL (old behaviour in inserts) and the same as local_mtr in updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it instead of the mini-transaction that is used for writing the BLOBs. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). Allocate the specified page from a partially free extent. fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or NULL if this is a "fake allocation" that prevents the reuse of a previously freed B-tree page for BLOB storage. If init_mtr==NULL, try harder to reallocate the specified page and assert that it succeeded. fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized. Do not allow init_mtr == NULL, because this function is never to be used for "fake allocations". mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag mtr->freed_clust_leaf for quickly determining if any MTR_MEMO_FREE_CLUST_LEAF operations have been posted. row_ins_index_entry_low(): When columns are being made off-page in insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. row_build(): Correct a comment, and add a debug assertion that a record that contains NULL BLOB pointers must be a fresh insert. row_upd_clust_rec(): When columns are being moved off-page, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. buf_reset_check_index_page_at_flush(): Remove. The function fsp_init_file_page_low() already sets bpage->check_index_page_at_flush=FALSE. There is a known issue in tablespace extension. If the request to allocate a BLOB page leads to the tablespace being extended, crash recovery could see BLOB writes to pages that are off the tablespace file bounds. This should trigger an assertion failure in fil_io() at crash recovery. The safe thing would be to write redo log about the tablespace extension to the mini-transaction of the BLOB write, not to the mini-transaction of the record update. However, there is no redo log record for file extension in the current redo log format. rb:693 approved by Sunny Bains
14 years ago
Bug#12704861 Corruption after a crash during BLOB update The fix of Bug#12612184 broke crash recovery. When a record that contains off-page columns (BLOBs) is updated, we must first write redo log about the BLOB page writes, and only after that write the redo log about the B-tree changes. The buggy fix would log the B-tree changes first, meaning that after recovery, we could end up having a record that contains a null BLOB pointer. Because we will be redo logging the writes off the off-page columns before the B-tree changes, we must make sure that the pages chosen for the off-page columns are free both before and after the B-tree changes. In this way, the worst thing that can happen in crash recovery is that the BLOBs are written to free pages, but the B-tree changes are not applied. The BLOB pages would correctly remain free in this case. To achieve this, we must allocate the BLOB pages in the mini-transaction of the B-tree operation. A further quirk is that BLOB pages are allocated from the same file segment as leaf pages. Because of this, we must temporarily "hide" any leaf pages that were freed during the B-tree operation by "fake allocating" them prior to writing the BLOBs, and freeing them again before the mtr_commit() of the B-tree operation, in btr_mark_freed_leaves(). btr_cur_mtr_commit_and_start(): Remove this faulty function that was introduced in the Bug#12612184 fix. The problem that this function was trying to address was that when we did mtr_commit() the BLOB writes before the mtr_commit() of the update, the new BLOB pages could have overwritten clustered index B-tree leaf pages that were freed during the update. If recovery applied the redo log of the BLOB writes but did not see the log of the record update, the index tree would be corrupted. The correct solution is to make the freed clustered index pages unavailable to the BLOB allocation. This function is also a likely culprit of InnoDB hangs that were observed when testing the Bug#12612184 fix. btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs, or freed (nonfree=FALSE) before committing the mini-transaction. btr_freed_leaves_validate(): A debug function for checking that all clustered index leaf pages that have been marked free in the mini-transaction are consistent (have not been zeroed out). btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the number of the allocated page, or FIL_NULL if out of space. Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or if this is a "fake allocation" (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE). btr_page_alloc(): Add the parameter init_mtr, allowing the page to be initialized and X-latched in a different mini-transaction than the one that is used for the allocation. Invoke btr_page_alloc_low(). If a clustered index leaf page was previously freed in mtr, remove it from the memo of previously freed pages. btr_page_free(): Assert that the page is a B-tree page and it has been X-latched by the mini-transaction. If the freed page was a leaf page of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to the mini-transaction. btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr, which is NULL (old behaviour in inserts) and the same as local_mtr in updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it instead of the mini-transaction that is used for writing the BLOBs. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). Allocate the specified page from a partially free extent. fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or NULL if this is a "fake allocation" that prevents the reuse of a previously freed B-tree page for BLOB storage. If init_mtr==NULL, try harder to reallocate the specified page and assert that it succeeded. fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized. Do not allow init_mtr == NULL, because this function is never to be used for "fake allocations". mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag mtr->freed_clust_leaf for quickly determining if any MTR_MEMO_FREE_CLUST_LEAF operations have been posted. row_ins_index_entry_low(): When columns are being made off-page in insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. row_build(): Correct a comment, and add a debug assertion that a record that contains NULL BLOB pointers must be a fresh insert. row_upd_clust_rec(): When columns are being moved off-page, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. buf_reset_check_index_page_at_flush(): Remove. The function fsp_init_file_page_low() already sets bpage->check_index_page_at_flush=FALSE. There is a known issue in tablespace extension. If the request to allocate a BLOB page leads to the tablespace being extended, crash recovery could see BLOB writes to pages that are off the tablespace file bounds. This should trigger an assertion failure in fil_io() at crash recovery. The safe thing would be to write redo log about the tablespace extension to the mini-transaction of the BLOB write, not to the mini-transaction of the record update. However, there is no redo log record for file extension in the current redo log format. rb:693 approved by Sunny Bains
14 years ago
Bug#12704861 Corruption after a crash during BLOB update The fix of Bug#12612184 broke crash recovery. When a record that contains off-page columns (BLOBs) is updated, we must first write redo log about the BLOB page writes, and only after that write the redo log about the B-tree changes. The buggy fix would log the B-tree changes first, meaning that after recovery, we could end up having a record that contains a null BLOB pointer. Because we will be redo logging the writes off the off-page columns before the B-tree changes, we must make sure that the pages chosen for the off-page columns are free both before and after the B-tree changes. In this way, the worst thing that can happen in crash recovery is that the BLOBs are written to free pages, but the B-tree changes are not applied. The BLOB pages would correctly remain free in this case. To achieve this, we must allocate the BLOB pages in the mini-transaction of the B-tree operation. A further quirk is that BLOB pages are allocated from the same file segment as leaf pages. Because of this, we must temporarily "hide" any leaf pages that were freed during the B-tree operation by "fake allocating" them prior to writing the BLOBs, and freeing them again before the mtr_commit() of the B-tree operation, in btr_mark_freed_leaves(). btr_cur_mtr_commit_and_start(): Remove this faulty function that was introduced in the Bug#12612184 fix. The problem that this function was trying to address was that when we did mtr_commit() the BLOB writes before the mtr_commit() of the update, the new BLOB pages could have overwritten clustered index B-tree leaf pages that were freed during the update. If recovery applied the redo log of the BLOB writes but did not see the log of the record update, the index tree would be corrupted. The correct solution is to make the freed clustered index pages unavailable to the BLOB allocation. This function is also a likely culprit of InnoDB hangs that were observed when testing the Bug#12612184 fix. btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs, or freed (nonfree=FALSE) before committing the mini-transaction. btr_freed_leaves_validate(): A debug function for checking that all clustered index leaf pages that have been marked free in the mini-transaction are consistent (have not been zeroed out). btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the number of the allocated page, or FIL_NULL if out of space. Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or if this is a "fake allocation" (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE). btr_page_alloc(): Add the parameter init_mtr, allowing the page to be initialized and X-latched in a different mini-transaction than the one that is used for the allocation. Invoke btr_page_alloc_low(). If a clustered index leaf page was previously freed in mtr, remove it from the memo of previously freed pages. btr_page_free(): Assert that the page is a B-tree page and it has been X-latched by the mini-transaction. If the freed page was a leaf page of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to the mini-transaction. btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr, which is NULL (old behaviour in inserts) and the same as local_mtr in updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it instead of the mini-transaction that is used for writing the BLOBs. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). Allocate the specified page from a partially free extent. fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or NULL if this is a "fake allocation" that prevents the reuse of a previously freed B-tree page for BLOB storage. If init_mtr==NULL, try harder to reallocate the specified page and assert that it succeeded. fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized. Do not allow init_mtr == NULL, because this function is never to be used for "fake allocations". mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag mtr->freed_clust_leaf for quickly determining if any MTR_MEMO_FREE_CLUST_LEAF operations have been posted. row_ins_index_entry_low(): When columns are being made off-page in insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. row_build(): Correct a comment, and add a debug assertion that a record that contains NULL BLOB pointers must be a fresh insert. row_upd_clust_rec(): When columns are being moved off-page, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. buf_reset_check_index_page_at_flush(): Remove. The function fsp_init_file_page_low() already sets bpage->check_index_page_at_flush=FALSE. There is a known issue in tablespace extension. If the request to allocate a BLOB page leads to the tablespace being extended, crash recovery could see BLOB writes to pages that are off the tablespace file bounds. This should trigger an assertion failure in fil_io() at crash recovery. The safe thing would be to write redo log about the tablespace extension to the mini-transaction of the BLOB write, not to the mini-transaction of the record update. However, there is no redo log record for file extension in the current redo log format. rb:693 approved by Sunny Bains
14 years ago
Bug#12704861 Corruption after a crash during BLOB update The fix of Bug#12612184 broke crash recovery. When a record that contains off-page columns (BLOBs) is updated, we must first write redo log about the BLOB page writes, and only after that write the redo log about the B-tree changes. The buggy fix would log the B-tree changes first, meaning that after recovery, we could end up having a record that contains a null BLOB pointer. Because we will be redo logging the writes off the off-page columns before the B-tree changes, we must make sure that the pages chosen for the off-page columns are free both before and after the B-tree changes. In this way, the worst thing that can happen in crash recovery is that the BLOBs are written to free pages, but the B-tree changes are not applied. The BLOB pages would correctly remain free in this case. To achieve this, we must allocate the BLOB pages in the mini-transaction of the B-tree operation. A further quirk is that BLOB pages are allocated from the same file segment as leaf pages. Because of this, we must temporarily "hide" any leaf pages that were freed during the B-tree operation by "fake allocating" them prior to writing the BLOBs, and freeing them again before the mtr_commit() of the B-tree operation, in btr_mark_freed_leaves(). btr_cur_mtr_commit_and_start(): Remove this faulty function that was introduced in the Bug#12612184 fix. The problem that this function was trying to address was that when we did mtr_commit() the BLOB writes before the mtr_commit() of the update, the new BLOB pages could have overwritten clustered index B-tree leaf pages that were freed during the update. If recovery applied the redo log of the BLOB writes but did not see the log of the record update, the index tree would be corrupted. The correct solution is to make the freed clustered index pages unavailable to the BLOB allocation. This function is also a likely culprit of InnoDB hangs that were observed when testing the Bug#12612184 fix. btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs, or freed (nonfree=FALSE) before committing the mini-transaction. btr_freed_leaves_validate(): A debug function for checking that all clustered index leaf pages that have been marked free in the mini-transaction are consistent (have not been zeroed out). btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the number of the allocated page, or FIL_NULL if out of space. Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or if this is a "fake allocation" (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE). btr_page_alloc(): Add the parameter init_mtr, allowing the page to be initialized and X-latched in a different mini-transaction than the one that is used for the allocation. Invoke btr_page_alloc_low(). If a clustered index leaf page was previously freed in mtr, remove it from the memo of previously freed pages. btr_page_free(): Assert that the page is a B-tree page and it has been X-latched by the mini-transaction. If the freed page was a leaf page of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to the mini-transaction. btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr, which is NULL (old behaviour in inserts) and the same as local_mtr in updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it instead of the mini-transaction that is used for writing the BLOBs. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). Allocate the specified page from a partially free extent. fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or NULL if this is a "fake allocation" that prevents the reuse of a previously freed B-tree page for BLOB storage. If init_mtr==NULL, try harder to reallocate the specified page and assert that it succeeded. fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized. Do not allow init_mtr == NULL, because this function is never to be used for "fake allocations". mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag mtr->freed_clust_leaf for quickly determining if any MTR_MEMO_FREE_CLUST_LEAF operations have been posted. row_ins_index_entry_low(): When columns are being made off-page in insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. row_build(): Correct a comment, and add a debug assertion that a record that contains NULL BLOB pointers must be a fresh insert. row_upd_clust_rec(): When columns are being moved off-page, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. buf_reset_check_index_page_at_flush(): Remove. The function fsp_init_file_page_low() already sets bpage->check_index_page_at_flush=FALSE. There is a known issue in tablespace extension. If the request to allocate a BLOB page leads to the tablespace being extended, crash recovery could see BLOB writes to pages that are off the tablespace file bounds. This should trigger an assertion failure in fil_io() at crash recovery. The safe thing would be to write redo log about the tablespace extension to the mini-transaction of the BLOB write, not to the mini-transaction of the record update. However, there is no redo log record for file extension in the current redo log format. rb:693 approved by Sunny Bains
14 years ago
Bug#12704861 Corruption after a crash during BLOB update The fix of Bug#12612184 broke crash recovery. When a record that contains off-page columns (BLOBs) is updated, we must first write redo log about the BLOB page writes, and only after that write the redo log about the B-tree changes. The buggy fix would log the B-tree changes first, meaning that after recovery, we could end up having a record that contains a null BLOB pointer. Because we will be redo logging the writes off the off-page columns before the B-tree changes, we must make sure that the pages chosen for the off-page columns are free both before and after the B-tree changes. In this way, the worst thing that can happen in crash recovery is that the BLOBs are written to free pages, but the B-tree changes are not applied. The BLOB pages would correctly remain free in this case. To achieve this, we must allocate the BLOB pages in the mini-transaction of the B-tree operation. A further quirk is that BLOB pages are allocated from the same file segment as leaf pages. Because of this, we must temporarily "hide" any leaf pages that were freed during the B-tree operation by "fake allocating" them prior to writing the BLOBs, and freeing them again before the mtr_commit() of the B-tree operation, in btr_mark_freed_leaves(). btr_cur_mtr_commit_and_start(): Remove this faulty function that was introduced in the Bug#12612184 fix. The problem that this function was trying to address was that when we did mtr_commit() the BLOB writes before the mtr_commit() of the update, the new BLOB pages could have overwritten clustered index B-tree leaf pages that were freed during the update. If recovery applied the redo log of the BLOB writes but did not see the log of the record update, the index tree would be corrupted. The correct solution is to make the freed clustered index pages unavailable to the BLOB allocation. This function is also a likely culprit of InnoDB hangs that were observed when testing the Bug#12612184 fix. btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs, or freed (nonfree=FALSE) before committing the mini-transaction. btr_freed_leaves_validate(): A debug function for checking that all clustered index leaf pages that have been marked free in the mini-transaction are consistent (have not been zeroed out). btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the number of the allocated page, or FIL_NULL if out of space. Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or if this is a "fake allocation" (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE). btr_page_alloc(): Add the parameter init_mtr, allowing the page to be initialized and X-latched in a different mini-transaction than the one that is used for the allocation. Invoke btr_page_alloc_low(). If a clustered index leaf page was previously freed in mtr, remove it from the memo of previously freed pages. btr_page_free(): Assert that the page is a B-tree page and it has been X-latched by the mini-transaction. If the freed page was a leaf page of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to the mini-transaction. btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr, which is NULL (old behaviour in inserts) and the same as local_mtr in updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it instead of the mini-transaction that is used for writing the BLOBs. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). Allocate the specified page from a partially free extent. fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or NULL if this is a "fake allocation" that prevents the reuse of a previously freed B-tree page for BLOB storage. If init_mtr==NULL, try harder to reallocate the specified page and assert that it succeeded. fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized. Do not allow init_mtr == NULL, because this function is never to be used for "fake allocations". mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag mtr->freed_clust_leaf for quickly determining if any MTR_MEMO_FREE_CLUST_LEAF operations have been posted. row_ins_index_entry_low(): When columns are being made off-page in insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. row_build(): Correct a comment, and add a debug assertion that a record that contains NULL BLOB pointers must be a fresh insert. row_upd_clust_rec(): When columns are being moved off-page, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. buf_reset_check_index_page_at_flush(): Remove. The function fsp_init_file_page_low() already sets bpage->check_index_page_at_flush=FALSE. There is a known issue in tablespace extension. If the request to allocate a BLOB page leads to the tablespace being extended, crash recovery could see BLOB writes to pages that are off the tablespace file bounds. This should trigger an assertion failure in fil_io() at crash recovery. The safe thing would be to write redo log about the tablespace extension to the mini-transaction of the BLOB write, not to the mini-transaction of the record update. However, there is no redo log record for file extension in the current redo log format. rb:693 approved by Sunny Bains
14 years ago
Bug#12704861 Corruption after a crash during BLOB update The fix of Bug#12612184 broke crash recovery. When a record that contains off-page columns (BLOBs) is updated, we must first write redo log about the BLOB page writes, and only after that write the redo log about the B-tree changes. The buggy fix would log the B-tree changes first, meaning that after recovery, we could end up having a record that contains a null BLOB pointer. Because we will be redo logging the writes off the off-page columns before the B-tree changes, we must make sure that the pages chosen for the off-page columns are free both before and after the B-tree changes. In this way, the worst thing that can happen in crash recovery is that the BLOBs are written to free pages, but the B-tree changes are not applied. The BLOB pages would correctly remain free in this case. To achieve this, we must allocate the BLOB pages in the mini-transaction of the B-tree operation. A further quirk is that BLOB pages are allocated from the same file segment as leaf pages. Because of this, we must temporarily "hide" any leaf pages that were freed during the B-tree operation by "fake allocating" them prior to writing the BLOBs, and freeing them again before the mtr_commit() of the B-tree operation, in btr_mark_freed_leaves(). btr_cur_mtr_commit_and_start(): Remove this faulty function that was introduced in the Bug#12612184 fix. The problem that this function was trying to address was that when we did mtr_commit() the BLOB writes before the mtr_commit() of the update, the new BLOB pages could have overwritten clustered index B-tree leaf pages that were freed during the update. If recovery applied the redo log of the BLOB writes but did not see the log of the record update, the index tree would be corrupted. The correct solution is to make the freed clustered index pages unavailable to the BLOB allocation. This function is also a likely culprit of InnoDB hangs that were observed when testing the Bug#12612184 fix. btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs, or freed (nonfree=FALSE) before committing the mini-transaction. btr_freed_leaves_validate(): A debug function for checking that all clustered index leaf pages that have been marked free in the mini-transaction are consistent (have not been zeroed out). btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the number of the allocated page, or FIL_NULL if out of space. Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or if this is a "fake allocation" (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE). btr_page_alloc(): Add the parameter init_mtr, allowing the page to be initialized and X-latched in a different mini-transaction than the one that is used for the allocation. Invoke btr_page_alloc_low(). If a clustered index leaf page was previously freed in mtr, remove it from the memo of previously freed pages. btr_page_free(): Assert that the page is a B-tree page and it has been X-latched by the mini-transaction. If the freed page was a leaf page of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to the mini-transaction. btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr, which is NULL (old behaviour in inserts) and the same as local_mtr in updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it instead of the mini-transaction that is used for writing the BLOBs. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). Allocate the specified page from a partially free extent. fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or NULL if this is a "fake allocation" that prevents the reuse of a previously freed B-tree page for BLOB storage. If init_mtr==NULL, try harder to reallocate the specified page and assert that it succeeded. fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized. Do not allow init_mtr == NULL, because this function is never to be used for "fake allocations". mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag mtr->freed_clust_leaf for quickly determining if any MTR_MEMO_FREE_CLUST_LEAF operations have been posted. row_ins_index_entry_low(): When columns are being made off-page in insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. row_build(): Correct a comment, and add a debug assertion that a record that contains NULL BLOB pointers must be a fresh insert. row_upd_clust_rec(): When columns are being moved off-page, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. buf_reset_check_index_page_at_flush(): Remove. The function fsp_init_file_page_low() already sets bpage->check_index_page_at_flush=FALSE. There is a known issue in tablespace extension. If the request to allocate a BLOB page leads to the tablespace being extended, crash recovery could see BLOB writes to pages that are off the tablespace file bounds. This should trigger an assertion failure in fil_io() at crash recovery. The safe thing would be to write redo log about the tablespace extension to the mini-transaction of the BLOB write, not to the mini-transaction of the record update. However, there is no redo log record for file extension in the current redo log format. rb:693 approved by Sunny Bains
14 years ago
Bug#12704861 Corruption after a crash during BLOB update The fix of Bug#12612184 broke crash recovery. When a record that contains off-page columns (BLOBs) is updated, we must first write redo log about the BLOB page writes, and only after that write the redo log about the B-tree changes. The buggy fix would log the B-tree changes first, meaning that after recovery, we could end up having a record that contains a null BLOB pointer. Because we will be redo logging the writes off the off-page columns before the B-tree changes, we must make sure that the pages chosen for the off-page columns are free both before and after the B-tree changes. In this way, the worst thing that can happen in crash recovery is that the BLOBs are written to free pages, but the B-tree changes are not applied. The BLOB pages would correctly remain free in this case. To achieve this, we must allocate the BLOB pages in the mini-transaction of the B-tree operation. A further quirk is that BLOB pages are allocated from the same file segment as leaf pages. Because of this, we must temporarily "hide" any leaf pages that were freed during the B-tree operation by "fake allocating" them prior to writing the BLOBs, and freeing them again before the mtr_commit() of the B-tree operation, in btr_mark_freed_leaves(). btr_cur_mtr_commit_and_start(): Remove this faulty function that was introduced in the Bug#12612184 fix. The problem that this function was trying to address was that when we did mtr_commit() the BLOB writes before the mtr_commit() of the update, the new BLOB pages could have overwritten clustered index B-tree leaf pages that were freed during the update. If recovery applied the redo log of the BLOB writes but did not see the log of the record update, the index tree would be corrupted. The correct solution is to make the freed clustered index pages unavailable to the BLOB allocation. This function is also a likely culprit of InnoDB hangs that were observed when testing the Bug#12612184 fix. btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs, or freed (nonfree=FALSE) before committing the mini-transaction. btr_freed_leaves_validate(): A debug function for checking that all clustered index leaf pages that have been marked free in the mini-transaction are consistent (have not been zeroed out). btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the number of the allocated page, or FIL_NULL if out of space. Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or if this is a "fake allocation" (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE). btr_page_alloc(): Add the parameter init_mtr, allowing the page to be initialized and X-latched in a different mini-transaction than the one that is used for the allocation. Invoke btr_page_alloc_low(). If a clustered index leaf page was previously freed in mtr, remove it from the memo of previously freed pages. btr_page_free(): Assert that the page is a B-tree page and it has been X-latched by the mini-transaction. If the freed page was a leaf page of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to the mini-transaction. btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr, which is NULL (old behaviour in inserts) and the same as local_mtr in updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it instead of the mini-transaction that is used for writing the BLOBs. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). Allocate the specified page from a partially free extent. fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or NULL if this is a "fake allocation" that prevents the reuse of a previously freed B-tree page for BLOB storage. If init_mtr==NULL, try harder to reallocate the specified page and assert that it succeeded. fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized. Do not allow init_mtr == NULL, because this function is never to be used for "fake allocations". mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag mtr->freed_clust_leaf for quickly determining if any MTR_MEMO_FREE_CLUST_LEAF operations have been posted. row_ins_index_entry_low(): When columns are being made off-page in insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. row_build(): Correct a comment, and add a debug assertion that a record that contains NULL BLOB pointers must be a fresh insert. row_upd_clust_rec(): When columns are being moved off-page, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. buf_reset_check_index_page_at_flush(): Remove. The function fsp_init_file_page_low() already sets bpage->check_index_page_at_flush=FALSE. There is a known issue in tablespace extension. If the request to allocate a BLOB page leads to the tablespace being extended, crash recovery could see BLOB writes to pages that are off the tablespace file bounds. This should trigger an assertion failure in fil_io() at crash recovery. The safe thing would be to write redo log about the tablespace extension to the mini-transaction of the BLOB write, not to the mini-transaction of the record update. However, there is no redo log record for file extension in the current redo log format. rb:693 approved by Sunny Bains
14 years ago
Bug#12704861 Corruption after a crash during BLOB update The fix of Bug#12612184 broke crash recovery. When a record that contains off-page columns (BLOBs) is updated, we must first write redo log about the BLOB page writes, and only after that write the redo log about the B-tree changes. The buggy fix would log the B-tree changes first, meaning that after recovery, we could end up having a record that contains a null BLOB pointer. Because we will be redo logging the writes off the off-page columns before the B-tree changes, we must make sure that the pages chosen for the off-page columns are free both before and after the B-tree changes. In this way, the worst thing that can happen in crash recovery is that the BLOBs are written to free pages, but the B-tree changes are not applied. The BLOB pages would correctly remain free in this case. To achieve this, we must allocate the BLOB pages in the mini-transaction of the B-tree operation. A further quirk is that BLOB pages are allocated from the same file segment as leaf pages. Because of this, we must temporarily "hide" any leaf pages that were freed during the B-tree operation by "fake allocating" them prior to writing the BLOBs, and freeing them again before the mtr_commit() of the B-tree operation, in btr_mark_freed_leaves(). btr_cur_mtr_commit_and_start(): Remove this faulty function that was introduced in the Bug#12612184 fix. The problem that this function was trying to address was that when we did mtr_commit() the BLOB writes before the mtr_commit() of the update, the new BLOB pages could have overwritten clustered index B-tree leaf pages that were freed during the update. If recovery applied the redo log of the BLOB writes but did not see the log of the record update, the index tree would be corrupted. The correct solution is to make the freed clustered index pages unavailable to the BLOB allocation. This function is also a likely culprit of InnoDB hangs that were observed when testing the Bug#12612184 fix. btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs, or freed (nonfree=FALSE) before committing the mini-transaction. btr_freed_leaves_validate(): A debug function for checking that all clustered index leaf pages that have been marked free in the mini-transaction are consistent (have not been zeroed out). btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the number of the allocated page, or FIL_NULL if out of space. Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or if this is a "fake allocation" (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE). btr_page_alloc(): Add the parameter init_mtr, allowing the page to be initialized and X-latched in a different mini-transaction than the one that is used for the allocation. Invoke btr_page_alloc_low(). If a clustered index leaf page was previously freed in mtr, remove it from the memo of previously freed pages. btr_page_free(): Assert that the page is a B-tree page and it has been X-latched by the mini-transaction. If the freed page was a leaf page of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to the mini-transaction. btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr, which is NULL (old behaviour in inserts) and the same as local_mtr in updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it instead of the mini-transaction that is used for writing the BLOBs. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). Allocate the specified page from a partially free extent. fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or NULL if this is a "fake allocation" that prevents the reuse of a previously freed B-tree page for BLOB storage. If init_mtr==NULL, try harder to reallocate the specified page and assert that it succeeded. fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized. Do not allow init_mtr == NULL, because this function is never to be used for "fake allocations". mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag mtr->freed_clust_leaf for quickly determining if any MTR_MEMO_FREE_CLUST_LEAF operations have been posted. row_ins_index_entry_low(): When columns are being made off-page in insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. row_build(): Correct a comment, and add a debug assertion that a record that contains NULL BLOB pointers must be a fresh insert. row_upd_clust_rec(): When columns are being moved off-page, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. buf_reset_check_index_page_at_flush(): Remove. The function fsp_init_file_page_low() already sets bpage->check_index_page_at_flush=FALSE. There is a known issue in tablespace extension. If the request to allocate a BLOB page leads to the tablespace being extended, crash recovery could see BLOB writes to pages that are off the tablespace file bounds. This should trigger an assertion failure in fil_io() at crash recovery. The safe thing would be to write redo log about the tablespace extension to the mini-transaction of the BLOB write, not to the mini-transaction of the record update. However, there is no redo log record for file extension in the current redo log format. rb:693 approved by Sunny Bains
14 years ago
Bug#12704861 Corruption after a crash during BLOB update The fix of Bug#12612184 broke crash recovery. When a record that contains off-page columns (BLOBs) is updated, we must first write redo log about the BLOB page writes, and only after that write the redo log about the B-tree changes. The buggy fix would log the B-tree changes first, meaning that after recovery, we could end up having a record that contains a null BLOB pointer. Because we will be redo logging the writes off the off-page columns before the B-tree changes, we must make sure that the pages chosen for the off-page columns are free both before and after the B-tree changes. In this way, the worst thing that can happen in crash recovery is that the BLOBs are written to free pages, but the B-tree changes are not applied. The BLOB pages would correctly remain free in this case. To achieve this, we must allocate the BLOB pages in the mini-transaction of the B-tree operation. A further quirk is that BLOB pages are allocated from the same file segment as leaf pages. Because of this, we must temporarily "hide" any leaf pages that were freed during the B-tree operation by "fake allocating" them prior to writing the BLOBs, and freeing them again before the mtr_commit() of the B-tree operation, in btr_mark_freed_leaves(). btr_cur_mtr_commit_and_start(): Remove this faulty function that was introduced in the Bug#12612184 fix. The problem that this function was trying to address was that when we did mtr_commit() the BLOB writes before the mtr_commit() of the update, the new BLOB pages could have overwritten clustered index B-tree leaf pages that were freed during the update. If recovery applied the redo log of the BLOB writes but did not see the log of the record update, the index tree would be corrupted. The correct solution is to make the freed clustered index pages unavailable to the BLOB allocation. This function is also a likely culprit of InnoDB hangs that were observed when testing the Bug#12612184 fix. btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs, or freed (nonfree=FALSE) before committing the mini-transaction. btr_freed_leaves_validate(): A debug function for checking that all clustered index leaf pages that have been marked free in the mini-transaction are consistent (have not been zeroed out). btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the number of the allocated page, or FIL_NULL if out of space. Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or if this is a "fake allocation" (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE). btr_page_alloc(): Add the parameter init_mtr, allowing the page to be initialized and X-latched in a different mini-transaction than the one that is used for the allocation. Invoke btr_page_alloc_low(). If a clustered index leaf page was previously freed in mtr, remove it from the memo of previously freed pages. btr_page_free(): Assert that the page is a B-tree page and it has been X-latched by the mini-transaction. If the freed page was a leaf page of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to the mini-transaction. btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr, which is NULL (old behaviour in inserts) and the same as local_mtr in updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it instead of the mini-transaction that is used for writing the BLOBs. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). Allocate the specified page from a partially free extent. fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or NULL if this is a "fake allocation" that prevents the reuse of a previously freed B-tree page for BLOB storage. If init_mtr==NULL, try harder to reallocate the specified page and assert that it succeeded. fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized. Do not allow init_mtr == NULL, because this function is never to be used for "fake allocations". mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag mtr->freed_clust_leaf for quickly determining if any MTR_MEMO_FREE_CLUST_LEAF operations have been posted. row_ins_index_entry_low(): When columns are being made off-page in insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. row_build(): Correct a comment, and add a debug assertion that a record that contains NULL BLOB pointers must be a fresh insert. row_upd_clust_rec(): When columns are being moved off-page, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. buf_reset_check_index_page_at_flush(): Remove. The function fsp_init_file_page_low() already sets bpage->check_index_page_at_flush=FALSE. There is a known issue in tablespace extension. If the request to allocate a BLOB page leads to the tablespace being extended, crash recovery could see BLOB writes to pages that are off the tablespace file bounds. This should trigger an assertion failure in fil_io() at crash recovery. The safe thing would be to write redo log about the tablespace extension to the mini-transaction of the BLOB write, not to the mini-transaction of the record update. However, there is no redo log record for file extension in the current redo log format. rb:693 approved by Sunny Bains
14 years ago
Bug#12704861 Corruption after a crash during BLOB update The fix of Bug#12612184 broke crash recovery. When a record that contains off-page columns (BLOBs) is updated, we must first write redo log about the BLOB page writes, and only after that write the redo log about the B-tree changes. The buggy fix would log the B-tree changes first, meaning that after recovery, we could end up having a record that contains a null BLOB pointer. Because we will be redo logging the writes off the off-page columns before the B-tree changes, we must make sure that the pages chosen for the off-page columns are free both before and after the B-tree changes. In this way, the worst thing that can happen in crash recovery is that the BLOBs are written to free pages, but the B-tree changes are not applied. The BLOB pages would correctly remain free in this case. To achieve this, we must allocate the BLOB pages in the mini-transaction of the B-tree operation. A further quirk is that BLOB pages are allocated from the same file segment as leaf pages. Because of this, we must temporarily "hide" any leaf pages that were freed during the B-tree operation by "fake allocating" them prior to writing the BLOBs, and freeing them again before the mtr_commit() of the B-tree operation, in btr_mark_freed_leaves(). btr_cur_mtr_commit_and_start(): Remove this faulty function that was introduced in the Bug#12612184 fix. The problem that this function was trying to address was that when we did mtr_commit() the BLOB writes before the mtr_commit() of the update, the new BLOB pages could have overwritten clustered index B-tree leaf pages that were freed during the update. If recovery applied the redo log of the BLOB writes but did not see the log of the record update, the index tree would be corrupted. The correct solution is to make the freed clustered index pages unavailable to the BLOB allocation. This function is also a likely culprit of InnoDB hangs that were observed when testing the Bug#12612184 fix. btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs, or freed (nonfree=FALSE) before committing the mini-transaction. btr_freed_leaves_validate(): A debug function for checking that all clustered index leaf pages that have been marked free in the mini-transaction are consistent (have not been zeroed out). btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the number of the allocated page, or FIL_NULL if out of space. Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or if this is a "fake allocation" (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE). btr_page_alloc(): Add the parameter init_mtr, allowing the page to be initialized and X-latched in a different mini-transaction than the one that is used for the allocation. Invoke btr_page_alloc_low(). If a clustered index leaf page was previously freed in mtr, remove it from the memo of previously freed pages. btr_page_free(): Assert that the page is a B-tree page and it has been X-latched by the mini-transaction. If the freed page was a leaf page of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to the mini-transaction. btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr, which is NULL (old behaviour in inserts) and the same as local_mtr in updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it instead of the mini-transaction that is used for writing the BLOBs. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). Allocate the specified page from a partially free extent. fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or NULL if this is a "fake allocation" that prevents the reuse of a previously freed B-tree page for BLOB storage. If init_mtr==NULL, try harder to reallocate the specified page and assert that it succeeded. fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized. Do not allow init_mtr == NULL, because this function is never to be used for "fake allocations". mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag mtr->freed_clust_leaf for quickly determining if any MTR_MEMO_FREE_CLUST_LEAF operations have been posted. row_ins_index_entry_low(): When columns are being made off-page in insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. row_build(): Correct a comment, and add a debug assertion that a record that contains NULL BLOB pointers must be a fresh insert. row_upd_clust_rec(): When columns are being moved off-page, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. buf_reset_check_index_page_at_flush(): Remove. The function fsp_init_file_page_low() already sets bpage->check_index_page_at_flush=FALSE. There is a known issue in tablespace extension. If the request to allocate a BLOB page leads to the tablespace being extended, crash recovery could see BLOB writes to pages that are off the tablespace file bounds. This should trigger an assertion failure in fil_io() at crash recovery. The safe thing would be to write redo log about the tablespace extension to the mini-transaction of the BLOB write, not to the mini-transaction of the record update. However, there is no redo log record for file extension in the current redo log format. rb:693 approved by Sunny Bains
14 years ago
Bug#12704861 Corruption after a crash during BLOB update The fix of Bug#12612184 broke crash recovery. When a record that contains off-page columns (BLOBs) is updated, we must first write redo log about the BLOB page writes, and only after that write the redo log about the B-tree changes. The buggy fix would log the B-tree changes first, meaning that after recovery, we could end up having a record that contains a null BLOB pointer. Because we will be redo logging the writes off the off-page columns before the B-tree changes, we must make sure that the pages chosen for the off-page columns are free both before and after the B-tree changes. In this way, the worst thing that can happen in crash recovery is that the BLOBs are written to free pages, but the B-tree changes are not applied. The BLOB pages would correctly remain free in this case. To achieve this, we must allocate the BLOB pages in the mini-transaction of the B-tree operation. A further quirk is that BLOB pages are allocated from the same file segment as leaf pages. Because of this, we must temporarily "hide" any leaf pages that were freed during the B-tree operation by "fake allocating" them prior to writing the BLOBs, and freeing them again before the mtr_commit() of the B-tree operation, in btr_mark_freed_leaves(). btr_cur_mtr_commit_and_start(): Remove this faulty function that was introduced in the Bug#12612184 fix. The problem that this function was trying to address was that when we did mtr_commit() the BLOB writes before the mtr_commit() of the update, the new BLOB pages could have overwritten clustered index B-tree leaf pages that were freed during the update. If recovery applied the redo log of the BLOB writes but did not see the log of the record update, the index tree would be corrupted. The correct solution is to make the freed clustered index pages unavailable to the BLOB allocation. This function is also a likely culprit of InnoDB hangs that were observed when testing the Bug#12612184 fix. btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs, or freed (nonfree=FALSE) before committing the mini-transaction. btr_freed_leaves_validate(): A debug function for checking that all clustered index leaf pages that have been marked free in the mini-transaction are consistent (have not been zeroed out). btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the number of the allocated page, or FIL_NULL if out of space. Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or if this is a "fake allocation" (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE). btr_page_alloc(): Add the parameter init_mtr, allowing the page to be initialized and X-latched in a different mini-transaction than the one that is used for the allocation. Invoke btr_page_alloc_low(). If a clustered index leaf page was previously freed in mtr, remove it from the memo of previously freed pages. btr_page_free(): Assert that the page is a B-tree page and it has been X-latched by the mini-transaction. If the freed page was a leaf page of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to the mini-transaction. btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr, which is NULL (old behaviour in inserts) and the same as local_mtr in updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it instead of the mini-transaction that is used for writing the BLOBs. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). Allocate the specified page from a partially free extent. fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or NULL if this is a "fake allocation" that prevents the reuse of a previously freed B-tree page for BLOB storage. If init_mtr==NULL, try harder to reallocate the specified page and assert that it succeeded. fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized. Do not allow init_mtr == NULL, because this function is never to be used for "fake allocations". mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag mtr->freed_clust_leaf for quickly determining if any MTR_MEMO_FREE_CLUST_LEAF operations have been posted. row_ins_index_entry_low(): When columns are being made off-page in insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. row_build(): Correct a comment, and add a debug assertion that a record that contains NULL BLOB pointers must be a fresh insert. row_upd_clust_rec(): When columns are being moved off-page, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. buf_reset_check_index_page_at_flush(): Remove. The function fsp_init_file_page_low() already sets bpage->check_index_page_at_flush=FALSE. There is a known issue in tablespace extension. If the request to allocate a BLOB page leads to the tablespace being extended, crash recovery could see BLOB writes to pages that are off the tablespace file bounds. This should trigger an assertion failure in fil_io() at crash recovery. The safe thing would be to write redo log about the tablespace extension to the mini-transaction of the BLOB write, not to the mini-transaction of the record update. However, there is no redo log record for file extension in the current redo log format. rb:693 approved by Sunny Bains
14 years ago
Bug#12704861 Corruption after a crash during BLOB update The fix of Bug#12612184 broke crash recovery. When a record that contains off-page columns (BLOBs) is updated, we must first write redo log about the BLOB page writes, and only after that write the redo log about the B-tree changes. The buggy fix would log the B-tree changes first, meaning that after recovery, we could end up having a record that contains a null BLOB pointer. Because we will be redo logging the writes off the off-page columns before the B-tree changes, we must make sure that the pages chosen for the off-page columns are free both before and after the B-tree changes. In this way, the worst thing that can happen in crash recovery is that the BLOBs are written to free pages, but the B-tree changes are not applied. The BLOB pages would correctly remain free in this case. To achieve this, we must allocate the BLOB pages in the mini-transaction of the B-tree operation. A further quirk is that BLOB pages are allocated from the same file segment as leaf pages. Because of this, we must temporarily "hide" any leaf pages that were freed during the B-tree operation by "fake allocating" them prior to writing the BLOBs, and freeing them again before the mtr_commit() of the B-tree operation, in btr_mark_freed_leaves(). btr_cur_mtr_commit_and_start(): Remove this faulty function that was introduced in the Bug#12612184 fix. The problem that this function was trying to address was that when we did mtr_commit() the BLOB writes before the mtr_commit() of the update, the new BLOB pages could have overwritten clustered index B-tree leaf pages that were freed during the update. If recovery applied the redo log of the BLOB writes but did not see the log of the record update, the index tree would be corrupted. The correct solution is to make the freed clustered index pages unavailable to the BLOB allocation. This function is also a likely culprit of InnoDB hangs that were observed when testing the Bug#12612184 fix. btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs, or freed (nonfree=FALSE) before committing the mini-transaction. btr_freed_leaves_validate(): A debug function for checking that all clustered index leaf pages that have been marked free in the mini-transaction are consistent (have not been zeroed out). btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the number of the allocated page, or FIL_NULL if out of space. Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or if this is a "fake allocation" (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE). btr_page_alloc(): Add the parameter init_mtr, allowing the page to be initialized and X-latched in a different mini-transaction than the one that is used for the allocation. Invoke btr_page_alloc_low(). If a clustered index leaf page was previously freed in mtr, remove it from the memo of previously freed pages. btr_page_free(): Assert that the page is a B-tree page and it has been X-latched by the mini-transaction. If the freed page was a leaf page of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to the mini-transaction. btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr, which is NULL (old behaviour in inserts) and the same as local_mtr in updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it instead of the mini-transaction that is used for writing the BLOBs. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). Allocate the specified page from a partially free extent. fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or NULL if this is a "fake allocation" that prevents the reuse of a previously freed B-tree page for BLOB storage. If init_mtr==NULL, try harder to reallocate the specified page and assert that it succeeded. fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized. Do not allow init_mtr == NULL, because this function is never to be used for "fake allocations". mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag mtr->freed_clust_leaf for quickly determining if any MTR_MEMO_FREE_CLUST_LEAF operations have been posted. row_ins_index_entry_low(): When columns are being made off-page in insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. row_build(): Correct a comment, and add a debug assertion that a record that contains NULL BLOB pointers must be a fresh insert. row_upd_clust_rec(): When columns are being moved off-page, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. buf_reset_check_index_page_at_flush(): Remove. The function fsp_init_file_page_low() already sets bpage->check_index_page_at_flush=FALSE. There is a known issue in tablespace extension. If the request to allocate a BLOB page leads to the tablespace being extended, crash recovery could see BLOB writes to pages that are off the tablespace file bounds. This should trigger an assertion failure in fil_io() at crash recovery. The safe thing would be to write redo log about the tablespace extension to the mini-transaction of the BLOB write, not to the mini-transaction of the record update. However, there is no redo log record for file extension in the current redo log format. rb:693 approved by Sunny Bains
14 years ago
Bug#12704861 Corruption after a crash during BLOB update The fix of Bug#12612184 broke crash recovery. When a record that contains off-page columns (BLOBs) is updated, we must first write redo log about the BLOB page writes, and only after that write the redo log about the B-tree changes. The buggy fix would log the B-tree changes first, meaning that after recovery, we could end up having a record that contains a null BLOB pointer. Because we will be redo logging the writes off the off-page columns before the B-tree changes, we must make sure that the pages chosen for the off-page columns are free both before and after the B-tree changes. In this way, the worst thing that can happen in crash recovery is that the BLOBs are written to free pages, but the B-tree changes are not applied. The BLOB pages would correctly remain free in this case. To achieve this, we must allocate the BLOB pages in the mini-transaction of the B-tree operation. A further quirk is that BLOB pages are allocated from the same file segment as leaf pages. Because of this, we must temporarily "hide" any leaf pages that were freed during the B-tree operation by "fake allocating" them prior to writing the BLOBs, and freeing them again before the mtr_commit() of the B-tree operation, in btr_mark_freed_leaves(). btr_cur_mtr_commit_and_start(): Remove this faulty function that was introduced in the Bug#12612184 fix. The problem that this function was trying to address was that when we did mtr_commit() the BLOB writes before the mtr_commit() of the update, the new BLOB pages could have overwritten clustered index B-tree leaf pages that were freed during the update. If recovery applied the redo log of the BLOB writes but did not see the log of the record update, the index tree would be corrupted. The correct solution is to make the freed clustered index pages unavailable to the BLOB allocation. This function is also a likely culprit of InnoDB hangs that were observed when testing the Bug#12612184 fix. btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs, or freed (nonfree=FALSE) before committing the mini-transaction. btr_freed_leaves_validate(): A debug function for checking that all clustered index leaf pages that have been marked free in the mini-transaction are consistent (have not been zeroed out). btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the number of the allocated page, or FIL_NULL if out of space. Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or if this is a "fake allocation" (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE). btr_page_alloc(): Add the parameter init_mtr, allowing the page to be initialized and X-latched in a different mini-transaction than the one that is used for the allocation. Invoke btr_page_alloc_low(). If a clustered index leaf page was previously freed in mtr, remove it from the memo of previously freed pages. btr_page_free(): Assert that the page is a B-tree page and it has been X-latched by the mini-transaction. If the freed page was a leaf page of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to the mini-transaction. btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr, which is NULL (old behaviour in inserts) and the same as local_mtr in updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it instead of the mini-transaction that is used for writing the BLOBs. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). Allocate the specified page from a partially free extent. fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized, or NULL if this is a "fake allocation" that prevents the reuse of a previously freed B-tree page for BLOB storage. If init_mtr==NULL, try harder to reallocate the specified page and assert that it succeeded. fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for specifying the mini-transaction where the page should be initialized. Do not allow init_mtr == NULL, because this function is never to be used for "fake allocations". mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag mtr->freed_clust_leaf for quickly determining if any MTR_MEMO_FREE_CLUST_LEAF operations have been posted. row_ins_index_entry_low(): When columns are being made off-page in insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. row_build(): Correct a comment, and add a debug assertion that a record that contains NULL BLOB pointers must be a fresh insert. row_upd_clust_rec(): When columns are being moved off-page, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages. buf_reset_check_index_page_at_flush(): Remove. The function fsp_init_file_page_low() already sets bpage->check_index_page_at_flush=FALSE. There is a known issue in tablespace extension. If the request to allocate a BLOB page leads to the tablespace being extended, crash recovery could see BLOB writes to pages that are off the tablespace file bounds. This should trigger an assertion failure in fil_io() at crash recovery. The safe thing would be to write redo log about the tablespace extension to the mini-transaction of the BLOB write, not to the mini-transaction of the record update. However, there is no redo log record for file extension in the current redo log format. rb:693 approved by Sunny Bains
14 years ago
Applying InnoDB Plugin 1.0.5 snapshot, part 2 From r5639 to r5685 Detailed revision comments: r5639 | marko | 2009-08-06 05:39:34 -0500 (Thu, 06 Aug 2009) | 3 lines branches/zip: mem_heap_block_free(): If innodb_use_sys_malloc is set, do not tell Valgrind that the memory is free, to avoid a bogus warning in Valgrind's built-in free() hook. r5642 | calvin | 2009-08-06 18:04:03 -0500 (Thu, 06 Aug 2009) | 2 lines branches/zip: remove duplicate "the" in comments. r5662 | marko | 2009-08-11 04:54:16 -0500 (Tue, 11 Aug 2009) | 1 line branches/zip: Bump the version number to 1.0.5 after releasing 1.0.4. r5663 | marko | 2009-08-11 06:42:37 -0500 (Tue, 11 Aug 2009) | 2 lines branches/zip: trx_general_rollback_for_mysql(): Remove the redundant parameter partial. If savept==NULL, partial==FALSE. r5670 | marko | 2009-08-12 08:16:37 -0500 (Wed, 12 Aug 2009) | 2 lines branches/zip: trx_undo_rec_copy(): Add const qualifier to undo_rec. This is a non-functional change. r5671 | marko | 2009-08-13 03:46:33 -0500 (Thu, 13 Aug 2009) | 5 lines branches/zip: ha_innobase::add_index(): Fix Bug #46557: after a successful operation, read innodb_table->flags from the newly created table object, not from the old one that was just freed. Approved by Sunny. r5681 | sunny | 2009-08-14 01:16:24 -0500 (Fri, 14 Aug 2009) | 3 lines branches/zip: When building HotBackup srv_use_sys_malloc is #ifdef out. We move access to the this variable within a !UNIV_HOTBACKUP block. r5684 | sunny | 2009-08-20 03:05:30 -0500 (Thu, 20 Aug 2009) | 10 lines branches/zip: Fix bug# 46650: Innodb assertion autoinc_lock == lock in lock_table_remove_low on INSERT SELECT We only store the autoinc locks that are granted in the transaction's autoinc lock vector. A transacton, that has been rolled back due to a deadlock because of an AUTOINC lock attempt, will not have added that lock to the vector. We need to check for that when we remove that lock. rb://145 Approved by Marko. r5685 | sunny | 2009-08-20 03:18:29 -0500 (Thu, 20 Aug 2009) | 2 lines branches/zip: Update the ChangeLog with r5684 change.
16 years ago
Applying InnoDB Plugin 1.0.5 snapshot, part 2 From r5639 to r5685 Detailed revision comments: r5639 | marko | 2009-08-06 05:39:34 -0500 (Thu, 06 Aug 2009) | 3 lines branches/zip: mem_heap_block_free(): If innodb_use_sys_malloc is set, do not tell Valgrind that the memory is free, to avoid a bogus warning in Valgrind's built-in free() hook. r5642 | calvin | 2009-08-06 18:04:03 -0500 (Thu, 06 Aug 2009) | 2 lines branches/zip: remove duplicate "the" in comments. r5662 | marko | 2009-08-11 04:54:16 -0500 (Tue, 11 Aug 2009) | 1 line branches/zip: Bump the version number to 1.0.5 after releasing 1.0.4. r5663 | marko | 2009-08-11 06:42:37 -0500 (Tue, 11 Aug 2009) | 2 lines branches/zip: trx_general_rollback_for_mysql(): Remove the redundant parameter partial. If savept==NULL, partial==FALSE. r5670 | marko | 2009-08-12 08:16:37 -0500 (Wed, 12 Aug 2009) | 2 lines branches/zip: trx_undo_rec_copy(): Add const qualifier to undo_rec. This is a non-functional change. r5671 | marko | 2009-08-13 03:46:33 -0500 (Thu, 13 Aug 2009) | 5 lines branches/zip: ha_innobase::add_index(): Fix Bug #46557: after a successful operation, read innodb_table->flags from the newly created table object, not from the old one that was just freed. Approved by Sunny. r5681 | sunny | 2009-08-14 01:16:24 -0500 (Fri, 14 Aug 2009) | 3 lines branches/zip: When building HotBackup srv_use_sys_malloc is #ifdef out. We move access to the this variable within a !UNIV_HOTBACKUP block. r5684 | sunny | 2009-08-20 03:05:30 -0500 (Thu, 20 Aug 2009) | 10 lines branches/zip: Fix bug# 46650: Innodb assertion autoinc_lock == lock in lock_table_remove_low on INSERT SELECT We only store the autoinc locks that are granted in the transaction's autoinc lock vector. A transacton, that has been rolled back due to a deadlock because of an AUTOINC lock attempt, will not have added that lock to the vector. We need to check for that when we remove that lock. rb://145 Approved by Marko. r5685 | sunny | 2009-08-20 03:18:29 -0500 (Thu, 20 Aug 2009) | 2 lines branches/zip: Update the ChangeLog with r5684 change.
16 years ago
  1. /*****************************************************************************
  2. Copyright (c) 1996, 2011, Oracle and/or its affiliates. All Rights Reserved.
  3. This program is free software; you can redistribute it and/or modify it under
  4. the terms of the GNU General Public License as published by the Free Software
  5. Foundation; version 2 of the License.
  6. This program is distributed in the hope that it will be useful, but WITHOUT
  7. ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
  8. FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
  9. You should have received a copy of the GNU General Public License along with
  10. this program; if not, write to the Free Software Foundation, Inc., 59 Temple
  11. Place, Suite 330, Boston, MA 02111-1307 USA
  12. *****************************************************************************/
  13. /**************************************************//**
  14. @file trx/trx0rec.c
  15. Transaction undo log record
  16. Created 3/26/1996 Heikki Tuuri
  17. *******************************************************/
  18. #include "trx0rec.h"
  19. #ifdef UNIV_NONINL
  20. #include "trx0rec.ic"
  21. #endif
  22. #include "fsp0fsp.h"
  23. #include "mach0data.h"
  24. #include "trx0undo.h"
  25. #include "mtr0log.h"
  26. #ifndef UNIV_HOTBACKUP
  27. #include "dict0dict.h"
  28. #include "ut0mem.h"
  29. #include "row0ext.h"
  30. #include "row0upd.h"
  31. #include "que0que.h"
  32. #include "trx0purge.h"
  33. #include "trx0rseg.h"
  34. #include "row0row.h"
  35. /*=========== UNDO LOG RECORD CREATION AND DECODING ====================*/
  36. /**********************************************************************//**
  37. Writes the mtr log entry of the inserted undo log record on the undo log
  38. page. */
  39. UNIV_INLINE
  40. void
  41. trx_undof_page_add_undo_rec_log(
  42. /*============================*/
  43. page_t* undo_page, /*!< in: undo log page */
  44. ulint old_free, /*!< in: start offset of the inserted entry */
  45. ulint new_free, /*!< in: end offset of the entry */
  46. mtr_t* mtr) /*!< in: mtr */
  47. {
  48. byte* log_ptr;
  49. const byte* log_end;
  50. ulint len;
  51. log_ptr = mlog_open(mtr, 11 + 13 + MLOG_BUF_MARGIN);
  52. if (log_ptr == NULL) {
  53. return;
  54. }
  55. log_end = &log_ptr[11 + 13 + MLOG_BUF_MARGIN];
  56. log_ptr = mlog_write_initial_log_record_fast(
  57. undo_page, MLOG_UNDO_INSERT, log_ptr, mtr);
  58. len = new_free - old_free - 4;
  59. mach_write_to_2(log_ptr, len);
  60. log_ptr += 2;
  61. if (log_ptr + len <= log_end) {
  62. memcpy(log_ptr, undo_page + old_free + 2, len);
  63. mlog_close(mtr, log_ptr + len);
  64. } else {
  65. mlog_close(mtr, log_ptr);
  66. mlog_catenate_string(mtr, undo_page + old_free + 2, len);
  67. }
  68. }
  69. #endif /* !UNIV_HOTBACKUP */
  70. /***********************************************************//**
  71. Parses a redo log record of adding an undo log record.
  72. @return end of log record or NULL */
  73. UNIV_INTERN
  74. byte*
  75. trx_undo_parse_add_undo_rec(
  76. /*========================*/
  77. byte* ptr, /*!< in: buffer */
  78. byte* end_ptr,/*!< in: buffer end */
  79. page_t* page) /*!< in: page or NULL */
  80. {
  81. ulint len;
  82. byte* rec;
  83. ulint first_free;
  84. if (end_ptr < ptr + 2) {
  85. return(NULL);
  86. }
  87. len = mach_read_from_2(ptr);
  88. ptr += 2;
  89. if (end_ptr < ptr + len) {
  90. return(NULL);
  91. }
  92. if (page == NULL) {
  93. return(ptr + len);
  94. }
  95. first_free = mach_read_from_2(page + TRX_UNDO_PAGE_HDR
  96. + TRX_UNDO_PAGE_FREE);
  97. rec = page + first_free;
  98. mach_write_to_2(rec, first_free + 4 + len);
  99. mach_write_to_2(rec + 2 + len, first_free);
  100. mach_write_to_2(page + TRX_UNDO_PAGE_HDR + TRX_UNDO_PAGE_FREE,
  101. first_free + 4 + len);
  102. ut_memcpy(rec + 2, ptr, len);
  103. return(ptr + len);
  104. }
  105. #ifndef UNIV_HOTBACKUP
  106. /**********************************************************************//**
  107. Calculates the free space left for extending an undo log record.
  108. @return bytes left */
  109. UNIV_INLINE
  110. ulint
  111. trx_undo_left(
  112. /*==========*/
  113. const page_t* page, /*!< in: undo log page */
  114. const byte* ptr) /*!< in: pointer to page */
  115. {
  116. /* The '- 10' is a safety margin, in case we have some small
  117. calculation error below */
  118. return(UNIV_PAGE_SIZE - (ptr - page) - 10 - FIL_PAGE_DATA_END);
  119. }
  120. /**********************************************************************//**
  121. Set the next and previous pointers in the undo page for the undo record
  122. that was written to ptr. Update the first free value by the number of bytes
  123. written for this undo record.
  124. @return offset of the inserted entry on the page if succeeded, 0 if fail */
  125. static
  126. ulint
  127. trx_undo_page_set_next_prev_and_add(
  128. /*================================*/
  129. page_t* undo_page, /*!< in/out: undo log page */
  130. byte* ptr, /*!< in: ptr up to where data has been
  131. written on this undo page. */
  132. mtr_t* mtr) /*!< in: mtr */
  133. {
  134. ulint first_free; /*!< offset within undo_page */
  135. ulint end_of_rec; /*!< offset within undo_page */
  136. byte* ptr_to_first_free;
  137. /* pointer within undo_page
  138. that points to the next free
  139. offset value within undo_page.*/
  140. ut_ad(ptr > undo_page);
  141. ut_ad(ptr < undo_page + UNIV_PAGE_SIZE);
  142. if (UNIV_UNLIKELY(trx_undo_left(undo_page, ptr) < 2)) {
  143. return(0);
  144. }
  145. ptr_to_first_free = undo_page + TRX_UNDO_PAGE_HDR + TRX_UNDO_PAGE_FREE;
  146. first_free = mach_read_from_2(ptr_to_first_free);
  147. /* Write offset of the previous undo log record */
  148. mach_write_to_2(ptr, first_free);
  149. ptr += 2;
  150. end_of_rec = ptr - undo_page;
  151. /* Write offset of the next undo log record */
  152. mach_write_to_2(undo_page + first_free, end_of_rec);
  153. /* Update the offset to first free undo record */
  154. mach_write_to_2(ptr_to_first_free, end_of_rec);
  155. /* Write this log entry to the UNDO log */
  156. trx_undof_page_add_undo_rec_log(undo_page, first_free,
  157. end_of_rec, mtr);
  158. return(first_free);
  159. }
  160. /**********************************************************************//**
  161. Reports in the undo log of an insert of a clustered index record.
  162. @return offset of the inserted entry on the page if succeed, 0 if fail */
  163. static
  164. ulint
  165. trx_undo_page_report_insert(
  166. /*========================*/
  167. page_t* undo_page, /*!< in: undo log page */
  168. trx_t* trx, /*!< in: transaction */
  169. dict_index_t* index, /*!< in: clustered index */
  170. const dtuple_t* clust_entry, /*!< in: index entry which will be
  171. inserted to the clustered index */
  172. mtr_t* mtr) /*!< in: mtr */
  173. {
  174. ulint first_free;
  175. byte* ptr;
  176. ulint i;
  177. ut_ad(dict_index_is_clust(index));
  178. ut_ad(mach_read_from_2(undo_page + TRX_UNDO_PAGE_HDR
  179. + TRX_UNDO_PAGE_TYPE) == TRX_UNDO_INSERT);
  180. first_free = mach_read_from_2(undo_page + TRX_UNDO_PAGE_HDR
  181. + TRX_UNDO_PAGE_FREE);
  182. ptr = undo_page + first_free;
  183. ut_ad(first_free <= UNIV_PAGE_SIZE);
  184. if (trx_undo_left(undo_page, ptr) < 2 + 1 + 11 + 11) {
  185. /* Not enough space for writing the general parameters */
  186. return(0);
  187. }
  188. /* Reserve 2 bytes for the pointer to the next undo log record */
  189. ptr += 2;
  190. /* Store first some general parameters to the undo log */
  191. *ptr++ = TRX_UNDO_INSERT_REC;
  192. ptr += mach_dulint_write_much_compressed(ptr, trx->undo_no);
  193. ptr += mach_dulint_write_much_compressed(ptr, index->table->id);
  194. /*----------------------------------------*/
  195. /* Store then the fields required to uniquely determine the record
  196. to be inserted in the clustered index */
  197. for (i = 0; i < dict_index_get_n_unique(index); i++) {
  198. const dfield_t* field = dtuple_get_nth_field(clust_entry, i);
  199. ulint flen = dfield_get_len(field);
  200. if (trx_undo_left(undo_page, ptr) < 5) {
  201. return(0);
  202. }
  203. ptr += mach_write_compressed(ptr, flen);
  204. if (flen != UNIV_SQL_NULL) {
  205. if (trx_undo_left(undo_page, ptr) < flen) {
  206. return(0);
  207. }
  208. ut_memcpy(ptr, dfield_get_data(field), flen);
  209. ptr += flen;
  210. }
  211. }
  212. return(trx_undo_page_set_next_prev_and_add(undo_page, ptr, mtr));
  213. }
  214. /**********************************************************************//**
  215. Reads from an undo log record the general parameters.
  216. @return remaining part of undo log record after reading these values */
  217. UNIV_INTERN
  218. byte*
  219. trx_undo_rec_get_pars(
  220. /*==================*/
  221. trx_undo_rec_t* undo_rec, /*!< in: undo log record */
  222. ulint* type, /*!< out: undo record type:
  223. TRX_UNDO_INSERT_REC, ... */
  224. ulint* cmpl_info, /*!< out: compiler info, relevant only
  225. for update type records */
  226. ibool* updated_extern, /*!< out: TRUE if we updated an
  227. externally stored fild */
  228. undo_no_t* undo_no, /*!< out: undo log record number */
  229. dulint* table_id) /*!< out: table id */
  230. {
  231. byte* ptr;
  232. ulint type_cmpl;
  233. ptr = undo_rec + 2;
  234. type_cmpl = mach_read_from_1(ptr);
  235. ptr++;
  236. if (type_cmpl & TRX_UNDO_UPD_EXTERN) {
  237. *updated_extern = TRUE;
  238. type_cmpl -= TRX_UNDO_UPD_EXTERN;
  239. } else {
  240. *updated_extern = FALSE;
  241. }
  242. *type = type_cmpl & (TRX_UNDO_CMPL_INFO_MULT - 1);
  243. *cmpl_info = type_cmpl / TRX_UNDO_CMPL_INFO_MULT;
  244. *undo_no = mach_dulint_read_much_compressed(ptr);
  245. ptr += mach_dulint_get_much_compressed_size(*undo_no);
  246. *table_id = mach_dulint_read_much_compressed(ptr);
  247. ptr += mach_dulint_get_much_compressed_size(*table_id);
  248. return(ptr);
  249. }
  250. /**********************************************************************//**
  251. Reads from an undo log record a stored column value.
  252. @return remaining part of undo log record after reading these values */
  253. static
  254. byte*
  255. trx_undo_rec_get_col_val(
  256. /*=====================*/
  257. byte* ptr, /*!< in: pointer to remaining part of undo log record */
  258. byte** field, /*!< out: pointer to stored field */
  259. ulint* len, /*!< out: length of the field, or UNIV_SQL_NULL */
  260. ulint* orig_len)/*!< out: original length of the locally
  261. stored part of an externally stored column, or 0 */
  262. {
  263. *len = mach_read_compressed(ptr);
  264. ptr += mach_get_compressed_size(*len);
  265. *orig_len = 0;
  266. switch (*len) {
  267. case UNIV_SQL_NULL:
  268. *field = NULL;
  269. break;
  270. case UNIV_EXTERN_STORAGE_FIELD:
  271. *orig_len = mach_read_compressed(ptr);
  272. ptr += mach_get_compressed_size(*orig_len);
  273. *len = mach_read_compressed(ptr);
  274. ptr += mach_get_compressed_size(*len);
  275. *field = ptr;
  276. ptr += *len;
  277. ut_ad(*orig_len >= BTR_EXTERN_FIELD_REF_SIZE);
  278. ut_ad(*len > *orig_len);
  279. /* @see dtuple_convert_big_rec() */
  280. ut_ad(*len >= BTR_EXTERN_FIELD_REF_SIZE * 2);
  281. /* we do not have access to index->table here
  282. ut_ad(dict_table_get_format(index->table) >= DICT_TF_FORMAT_ZIP
  283. || *len >= REC_MAX_INDEX_COL_LEN
  284. + BTR_EXTERN_FIELD_REF_SIZE);
  285. */
  286. *len += UNIV_EXTERN_STORAGE_FIELD;
  287. break;
  288. default:
  289. *field = ptr;
  290. if (*len >= UNIV_EXTERN_STORAGE_FIELD) {
  291. ptr += *len - UNIV_EXTERN_STORAGE_FIELD;
  292. } else {
  293. ptr += *len;
  294. }
  295. }
  296. return(ptr);
  297. }
  298. /*******************************************************************//**
  299. Builds a row reference from an undo log record.
  300. @return pointer to remaining part of undo record */
  301. UNIV_INTERN
  302. byte*
  303. trx_undo_rec_get_row_ref(
  304. /*=====================*/
  305. byte* ptr, /*!< in: remaining part of a copy of an undo log
  306. record, at the start of the row reference;
  307. NOTE that this copy of the undo log record must
  308. be preserved as long as the row reference is
  309. used, as we do NOT copy the data in the
  310. record! */
  311. dict_index_t* index, /*!< in: clustered index */
  312. dtuple_t** ref, /*!< out, own: row reference */
  313. mem_heap_t* heap) /*!< in: memory heap from which the memory
  314. needed is allocated */
  315. {
  316. ulint ref_len;
  317. ulint i;
  318. ut_ad(index && ptr && ref && heap);
  319. ut_a(dict_index_is_clust(index));
  320. ref_len = dict_index_get_n_unique(index);
  321. *ref = dtuple_create(heap, ref_len);
  322. dict_index_copy_types(*ref, index, ref_len);
  323. for (i = 0; i < ref_len; i++) {
  324. dfield_t* dfield;
  325. byte* field;
  326. ulint len;
  327. ulint orig_len;
  328. dfield = dtuple_get_nth_field(*ref, i);
  329. ptr = trx_undo_rec_get_col_val(ptr, &field, &len, &orig_len);
  330. dfield_set_data(dfield, field, len);
  331. }
  332. return(ptr);
  333. }
  334. /*******************************************************************//**
  335. Skips a row reference from an undo log record.
  336. @return pointer to remaining part of undo record */
  337. UNIV_INTERN
  338. byte*
  339. trx_undo_rec_skip_row_ref(
  340. /*======================*/
  341. byte* ptr, /*!< in: remaining part in update undo log
  342. record, at the start of the row reference */
  343. dict_index_t* index) /*!< in: clustered index */
  344. {
  345. ulint ref_len;
  346. ulint i;
  347. ut_ad(index && ptr);
  348. ut_a(dict_index_is_clust(index));
  349. ref_len = dict_index_get_n_unique(index);
  350. for (i = 0; i < ref_len; i++) {
  351. byte* field;
  352. ulint len;
  353. ulint orig_len;
  354. ptr = trx_undo_rec_get_col_val(ptr, &field, &len, &orig_len);
  355. }
  356. return(ptr);
  357. }
  358. /**********************************************************************//**
  359. Fetch a prefix of an externally stored column, for writing to the undo log
  360. of an update or delete marking of a clustered index record.
  361. @return ext_buf */
  362. static
  363. byte*
  364. trx_undo_page_fetch_ext(
  365. /*====================*/
  366. byte* ext_buf, /*!< in: a buffer of
  367. REC_MAX_INDEX_COL_LEN
  368. + BTR_EXTERN_FIELD_REF_SIZE */
  369. ulint zip_size, /*!< compressed page size in bytes,
  370. or 0 for uncompressed BLOB */
  371. const byte* field, /*!< in: an externally stored column */
  372. ulint* len) /*!< in: length of field;
  373. out: used length of ext_buf */
  374. {
  375. /* Fetch the BLOB. */
  376. ulint ext_len = btr_copy_externally_stored_field_prefix(
  377. ext_buf, REC_MAX_INDEX_COL_LEN, zip_size, field, *len);
  378. /* BLOBs should always be nonempty. */
  379. ut_a(ext_len);
  380. /* Append the BLOB pointer to the prefix. */
  381. memcpy(ext_buf + ext_len,
  382. field + *len - BTR_EXTERN_FIELD_REF_SIZE,
  383. BTR_EXTERN_FIELD_REF_SIZE);
  384. *len = ext_len + BTR_EXTERN_FIELD_REF_SIZE;
  385. return(ext_buf);
  386. }
  387. /**********************************************************************//**
  388. Writes to the undo log a prefix of an externally stored column.
  389. @return undo log position */
  390. static
  391. byte*
  392. trx_undo_page_report_modify_ext(
  393. /*============================*/
  394. byte* ptr, /*!< in: undo log position,
  395. at least 15 bytes must be available */
  396. byte* ext_buf, /*!< in: a buffer of
  397. REC_MAX_INDEX_COL_LEN
  398. + BTR_EXTERN_FIELD_REF_SIZE,
  399. or NULL when should not fetch
  400. a longer prefix */
  401. ulint zip_size, /*!< compressed page size in bytes,
  402. or 0 for uncompressed BLOB */
  403. const byte** field, /*!< in/out: the locally stored part of
  404. the externally stored column */
  405. ulint* len) /*!< in/out: length of field, in bytes */
  406. {
  407. if (ext_buf) {
  408. /* If an ordering column is externally stored, we will
  409. have to store a longer prefix of the field. In this
  410. case, write to the log a marker followed by the
  411. original length and the real length of the field. */
  412. ptr += mach_write_compressed(ptr, UNIV_EXTERN_STORAGE_FIELD);
  413. ptr += mach_write_compressed(ptr, *len);
  414. *field = trx_undo_page_fetch_ext(ext_buf, zip_size,
  415. *field, len);
  416. ptr += mach_write_compressed(ptr, *len);
  417. } else {
  418. ptr += mach_write_compressed(ptr, UNIV_EXTERN_STORAGE_FIELD
  419. + *len);
  420. }
  421. return(ptr);
  422. }
  423. /**********************************************************************//**
  424. Reports in the undo log of an update or delete marking of a clustered index
  425. record.
  426. @return byte offset of the inserted undo log entry on the page if
  427. succeed, 0 if fail */
  428. static
  429. ulint
  430. trx_undo_page_report_modify(
  431. /*========================*/
  432. page_t* undo_page, /*!< in: undo log page */
  433. trx_t* trx, /*!< in: transaction */
  434. dict_index_t* index, /*!< in: clustered index where update or
  435. delete marking is done */
  436. const rec_t* rec, /*!< in: clustered index record which
  437. has NOT yet been modified */
  438. const ulint* offsets, /*!< in: rec_get_offsets(rec, index) */
  439. const upd_t* update, /*!< in: update vector which tells the
  440. columns to be updated; in the case of
  441. a delete, this should be set to NULL */
  442. ulint cmpl_info, /*!< in: compiler info on secondary
  443. index updates */
  444. mtr_t* mtr) /*!< in: mtr */
  445. {
  446. dict_table_t* table;
  447. ulint first_free;
  448. byte* ptr;
  449. const byte* field;
  450. ulint flen;
  451. ulint col_no;
  452. ulint type_cmpl;
  453. byte* type_cmpl_ptr;
  454. ulint i;
  455. trx_id_t trx_id;
  456. ibool ignore_prefix = FALSE;
  457. byte ext_buf[REC_MAX_INDEX_COL_LEN
  458. + BTR_EXTERN_FIELD_REF_SIZE];
  459. ut_a(dict_index_is_clust(index));
  460. ut_ad(rec_offs_validate(rec, index, offsets));
  461. ut_ad(mach_read_from_2(undo_page + TRX_UNDO_PAGE_HDR
  462. + TRX_UNDO_PAGE_TYPE) == TRX_UNDO_UPDATE);
  463. table = index->table;
  464. first_free = mach_read_from_2(undo_page + TRX_UNDO_PAGE_HDR
  465. + TRX_UNDO_PAGE_FREE);
  466. ptr = undo_page + first_free;
  467. ut_ad(first_free <= UNIV_PAGE_SIZE);
  468. if (trx_undo_left(undo_page, ptr) < 50) {
  469. /* NOTE: the value 50 must be big enough so that the general
  470. fields written below fit on the undo log page */
  471. return(0);
  472. }
  473. /* Reserve 2 bytes for the pointer to the next undo log record */
  474. ptr += 2;
  475. /* Store first some general parameters to the undo log */
  476. if (!update) {
  477. type_cmpl = TRX_UNDO_DEL_MARK_REC;
  478. } else if (rec_get_deleted_flag(rec, dict_table_is_comp(table))) {
  479. type_cmpl = TRX_UNDO_UPD_DEL_REC;
  480. /* We are about to update a delete marked record.
  481. We don't typically need the prefix in this case unless
  482. the delete marking is done by the same transaction
  483. (which we check below). */
  484. ignore_prefix = TRUE;
  485. } else {
  486. type_cmpl = TRX_UNDO_UPD_EXIST_REC;
  487. }
  488. type_cmpl |= cmpl_info * TRX_UNDO_CMPL_INFO_MULT;
  489. type_cmpl_ptr = ptr;
  490. *ptr++ = (byte) type_cmpl;
  491. ptr += mach_dulint_write_much_compressed(ptr, trx->undo_no);
  492. ptr += mach_dulint_write_much_compressed(ptr, table->id);
  493. /*----------------------------------------*/
  494. /* Store the state of the info bits */
  495. *ptr++ = (byte) rec_get_info_bits(rec, dict_table_is_comp(table));
  496. /* Store the values of the system columns */
  497. field = rec_get_nth_field(rec, offsets,
  498. dict_index_get_sys_col_pos(
  499. index, DATA_TRX_ID), &flen);
  500. ut_ad(flen == DATA_TRX_ID_LEN);
  501. trx_id = trx_read_trx_id(field);
  502. /* If it is an update of a delete marked record, then we are
  503. allowed to ignore blob prefixes if the delete marking was done
  504. by some other trx as it must have committed by now for us to
  505. allow an over-write. */
  506. if (ignore_prefix) {
  507. ignore_prefix = ut_dulint_cmp(trx_id, trx->id) != 0;
  508. }
  509. ptr += mach_dulint_write_compressed(ptr, trx_id);
  510. field = rec_get_nth_field(rec, offsets,
  511. dict_index_get_sys_col_pos(
  512. index, DATA_ROLL_PTR), &flen);
  513. ut_ad(flen == DATA_ROLL_PTR_LEN);
  514. ptr += mach_dulint_write_compressed(ptr, trx_read_roll_ptr(field));
  515. /*----------------------------------------*/
  516. /* Store then the fields required to uniquely determine the
  517. record which will be modified in the clustered index */
  518. for (i = 0; i < dict_index_get_n_unique(index); i++) {
  519. field = rec_get_nth_field(rec, offsets, i, &flen);
  520. /* The ordering columns must not be stored externally. */
  521. ut_ad(!rec_offs_nth_extern(offsets, i));
  522. ut_ad(dict_index_get_nth_col(index, i)->ord_part);
  523. if (trx_undo_left(undo_page, ptr) < 5) {
  524. return(0);
  525. }
  526. ptr += mach_write_compressed(ptr, flen);
  527. if (flen != UNIV_SQL_NULL) {
  528. if (trx_undo_left(undo_page, ptr) < flen) {
  529. return(0);
  530. }
  531. ut_memcpy(ptr, field, flen);
  532. ptr += flen;
  533. }
  534. }
  535. /*----------------------------------------*/
  536. /* Save to the undo log the old values of the columns to be updated. */
  537. if (update) {
  538. if (trx_undo_left(undo_page, ptr) < 5) {
  539. return(0);
  540. }
  541. ptr += mach_write_compressed(ptr, upd_get_n_fields(update));
  542. for (i = 0; i < upd_get_n_fields(update); i++) {
  543. ulint pos = upd_get_nth_field(update, i)->field_no;
  544. /* Write field number to undo log */
  545. if (trx_undo_left(undo_page, ptr) < 5) {
  546. return(0);
  547. }
  548. ptr += mach_write_compressed(ptr, pos);
  549. /* Save the old value of field */
  550. field = rec_get_nth_field(rec, offsets, pos, &flen);
  551. if (trx_undo_left(undo_page, ptr) < 15) {
  552. return(0);
  553. }
  554. if (rec_offs_nth_extern(offsets, pos)) {
  555. ptr = trx_undo_page_report_modify_ext(
  556. ptr,
  557. dict_index_get_nth_col(index, pos)
  558. ->ord_part
  559. && !ignore_prefix
  560. && flen < REC_MAX_INDEX_COL_LEN
  561. ? ext_buf : NULL,
  562. dict_table_zip_size(table),
  563. &field, &flen);
  564. /* Notify purge that it eventually has to
  565. free the old externally stored field */
  566. trx->update_undo->del_marks = TRUE;
  567. *type_cmpl_ptr |= TRX_UNDO_UPD_EXTERN;
  568. } else {
  569. ptr += mach_write_compressed(ptr, flen);
  570. }
  571. if (flen != UNIV_SQL_NULL) {
  572. if (trx_undo_left(undo_page, ptr) < flen) {
  573. return(0);
  574. }
  575. ut_memcpy(ptr, field, flen);
  576. ptr += flen;
  577. }
  578. }
  579. }
  580. /*----------------------------------------*/
  581. /* In the case of a delete marking, and also in the case of an update
  582. where any ordering field of any index changes, store the values of all
  583. columns which occur as ordering fields in any index. This info is used
  584. in the purge of old versions where we use it to build and search the
  585. delete marked index records, to look if we can remove them from the
  586. index tree. Note that starting from 4.0.14 also externally stored
  587. fields can be ordering in some index. Starting from 5.2, we no longer
  588. store REC_MAX_INDEX_COL_LEN first bytes to the undo log record,
  589. but we can construct the column prefix fields in the index by
  590. fetching the first page of the BLOB that is pointed to by the
  591. clustered index. This works also in crash recovery, because all pages
  592. (including BLOBs) are recovered before anything is rolled back. */
  593. if (!update || !(cmpl_info & UPD_NODE_NO_ORD_CHANGE)) {
  594. byte* old_ptr = ptr;
  595. trx->update_undo->del_marks = TRUE;
  596. if (trx_undo_left(undo_page, ptr) < 5) {
  597. return(0);
  598. }
  599. /* Reserve 2 bytes to write the number of bytes the stored
  600. fields take in this undo record */
  601. ptr += 2;
  602. for (col_no = 0; col_no < dict_table_get_n_cols(table);
  603. col_no++) {
  604. const dict_col_t* col
  605. = dict_table_get_nth_col(table, col_no);
  606. if (col->ord_part) {
  607. ulint pos;
  608. /* Write field number to undo log */
  609. if (trx_undo_left(undo_page, ptr) < 5 + 15) {
  610. return(0);
  611. }
  612. pos = dict_index_get_nth_col_pos(index,
  613. col_no);
  614. ptr += mach_write_compressed(ptr, pos);
  615. /* Save the old value of field */
  616. field = rec_get_nth_field(rec, offsets, pos,
  617. &flen);
  618. if (rec_offs_nth_extern(offsets, pos)) {
  619. ptr = trx_undo_page_report_modify_ext(
  620. ptr,
  621. flen < REC_MAX_INDEX_COL_LEN
  622. && !ignore_prefix
  623. ? ext_buf : NULL,
  624. dict_table_zip_size(table),
  625. &field, &flen);
  626. } else {
  627. ptr += mach_write_compressed(
  628. ptr, flen);
  629. }
  630. if (flen != UNIV_SQL_NULL) {
  631. if (trx_undo_left(undo_page, ptr)
  632. < flen) {
  633. return(0);
  634. }
  635. ut_memcpy(ptr, field, flen);
  636. ptr += flen;
  637. }
  638. }
  639. }
  640. mach_write_to_2(old_ptr, ptr - old_ptr);
  641. }
  642. /*----------------------------------------*/
  643. /* Write pointers to the previous and the next undo log records */
  644. if (trx_undo_left(undo_page, ptr) < 2) {
  645. return(0);
  646. }
  647. mach_write_to_2(ptr, first_free);
  648. ptr += 2;
  649. mach_write_to_2(undo_page + first_free, ptr - undo_page);
  650. mach_write_to_2(undo_page + TRX_UNDO_PAGE_HDR + TRX_UNDO_PAGE_FREE,
  651. ptr - undo_page);
  652. /* Write to the REDO log about this change in the UNDO log */
  653. trx_undof_page_add_undo_rec_log(undo_page, first_free,
  654. ptr - undo_page, mtr);
  655. return(first_free);
  656. }
  657. /**********************************************************************//**
  658. Reads from an undo log update record the system field values of the old
  659. version.
  660. @return remaining part of undo log record after reading these values */
  661. UNIV_INTERN
  662. byte*
  663. trx_undo_update_rec_get_sys_cols(
  664. /*=============================*/
  665. byte* ptr, /*!< in: remaining part of undo
  666. log record after reading
  667. general parameters */
  668. trx_id_t* trx_id, /*!< out: trx id */
  669. roll_ptr_t* roll_ptr, /*!< out: roll ptr */
  670. ulint* info_bits) /*!< out: info bits state */
  671. {
  672. /* Read the state of the info bits */
  673. *info_bits = mach_read_from_1(ptr);
  674. ptr += 1;
  675. /* Read the values of the system columns */
  676. *trx_id = mach_dulint_read_compressed(ptr);
  677. ptr += mach_dulint_get_compressed_size(*trx_id);
  678. *roll_ptr = mach_dulint_read_compressed(ptr);
  679. ptr += mach_dulint_get_compressed_size(*roll_ptr);
  680. return(ptr);
  681. }
  682. /**********************************************************************//**
  683. Reads from an update undo log record the number of updated fields.
  684. @return remaining part of undo log record after reading this value */
  685. UNIV_INLINE
  686. byte*
  687. trx_undo_update_rec_get_n_upd_fields(
  688. /*=================================*/
  689. byte* ptr, /*!< in: pointer to remaining part of undo log record */
  690. ulint* n) /*!< out: number of fields */
  691. {
  692. *n = mach_read_compressed(ptr);
  693. ptr += mach_get_compressed_size(*n);
  694. return(ptr);
  695. }
  696. /**********************************************************************//**
  697. Reads from an update undo log record a stored field number.
  698. @return remaining part of undo log record after reading this value */
  699. UNIV_INLINE
  700. byte*
  701. trx_undo_update_rec_get_field_no(
  702. /*=============================*/
  703. byte* ptr, /*!< in: pointer to remaining part of undo log record */
  704. ulint* field_no)/*!< out: field number */
  705. {
  706. *field_no = mach_read_compressed(ptr);
  707. ptr += mach_get_compressed_size(*field_no);
  708. return(ptr);
  709. }
  710. /*******************************************************************//**
  711. Builds an update vector based on a remaining part of an undo log record.
  712. @return remaining part of the record, NULL if an error detected, which
  713. means that the record is corrupted */
  714. UNIV_INTERN
  715. byte*
  716. trx_undo_update_rec_get_update(
  717. /*===========================*/
  718. byte* ptr, /*!< in: remaining part in update undo log
  719. record, after reading the row reference
  720. NOTE that this copy of the undo log record must
  721. be preserved as long as the update vector is
  722. used, as we do NOT copy the data in the
  723. record! */
  724. dict_index_t* index, /*!< in: clustered index */
  725. ulint type, /*!< in: TRX_UNDO_UPD_EXIST_REC,
  726. TRX_UNDO_UPD_DEL_REC, or
  727. TRX_UNDO_DEL_MARK_REC; in the last case,
  728. only trx id and roll ptr fields are added to
  729. the update vector */
  730. trx_id_t trx_id, /*!< in: transaction id from this undo record */
  731. roll_ptr_t roll_ptr,/*!< in: roll pointer from this undo record */
  732. ulint info_bits,/*!< in: info bits from this undo record */
  733. trx_t* trx, /*!< in: transaction */
  734. mem_heap_t* heap, /*!< in: memory heap from which the memory
  735. needed is allocated */
  736. upd_t** upd) /*!< out, own: update vector */
  737. {
  738. upd_field_t* upd_field;
  739. upd_t* update;
  740. ulint n_fields;
  741. byte* buf;
  742. ulint i;
  743. ut_a(dict_index_is_clust(index));
  744. if (type != TRX_UNDO_DEL_MARK_REC) {
  745. ptr = trx_undo_update_rec_get_n_upd_fields(ptr, &n_fields);
  746. } else {
  747. n_fields = 0;
  748. }
  749. update = upd_create(n_fields + 2, heap);
  750. update->info_bits = info_bits;
  751. /* Store first trx id and roll ptr to update vector */
  752. upd_field = upd_get_nth_field(update, n_fields);
  753. buf = mem_heap_alloc(heap, DATA_TRX_ID_LEN);
  754. trx_write_trx_id(buf, trx_id);
  755. upd_field_set_field_no(upd_field,
  756. dict_index_get_sys_col_pos(index, DATA_TRX_ID),
  757. index, trx);
  758. dfield_set_data(&(upd_field->new_val), buf, DATA_TRX_ID_LEN);
  759. upd_field = upd_get_nth_field(update, n_fields + 1);
  760. buf = mem_heap_alloc(heap, DATA_ROLL_PTR_LEN);
  761. trx_write_roll_ptr(buf, roll_ptr);
  762. upd_field_set_field_no(
  763. upd_field, dict_index_get_sys_col_pos(index, DATA_ROLL_PTR),
  764. index, trx);
  765. dfield_set_data(&(upd_field->new_val), buf, DATA_ROLL_PTR_LEN);
  766. /* Store then the updated ordinary columns to the update vector */
  767. for (i = 0; i < n_fields; i++) {
  768. byte* field;
  769. ulint len;
  770. ulint field_no;
  771. ulint orig_len;
  772. ptr = trx_undo_update_rec_get_field_no(ptr, &field_no);
  773. if (field_no >= dict_index_get_n_fields(index)) {
  774. fprintf(stderr,
  775. "InnoDB: Error: trying to access"
  776. " update undo rec field %lu in ",
  777. (ulong) field_no);
  778. dict_index_name_print(stderr, trx, index);
  779. fprintf(stderr, "\n"
  780. "InnoDB: but index has only %lu fields\n"
  781. "InnoDB: Submit a detailed bug report"
  782. " to http://bugs.mysql.com\n"
  783. "InnoDB: Run also CHECK TABLE ",
  784. (ulong) dict_index_get_n_fields(index));
  785. ut_print_name(stderr, trx, TRUE, index->table_name);
  786. fprintf(stderr, "\n"
  787. "InnoDB: n_fields = %lu, i = %lu, ptr %p\n",
  788. (ulong) n_fields, (ulong) i, ptr);
  789. *upd = NULL;
  790. return(NULL);
  791. }
  792. upd_field = upd_get_nth_field(update, i);
  793. upd_field_set_field_no(upd_field, field_no, index, trx);
  794. ptr = trx_undo_rec_get_col_val(ptr, &field, &len, &orig_len);
  795. upd_field->orig_len = orig_len;
  796. if (len == UNIV_SQL_NULL) {
  797. dfield_set_null(&upd_field->new_val);
  798. } else if (len < UNIV_EXTERN_STORAGE_FIELD) {
  799. dfield_set_data(&upd_field->new_val, field, len);
  800. } else {
  801. len -= UNIV_EXTERN_STORAGE_FIELD;
  802. dfield_set_data(&upd_field->new_val, field, len);
  803. dfield_set_ext(&upd_field->new_val);
  804. }
  805. }
  806. *upd = update;
  807. return(ptr);
  808. }
  809. /*******************************************************************//**
  810. Builds a partial row from an update undo log record. It contains the
  811. columns which occur as ordering in any index of the table.
  812. @return pointer to remaining part of undo record */
  813. UNIV_INTERN
  814. byte*
  815. trx_undo_rec_get_partial_row(
  816. /*=========================*/
  817. byte* ptr, /*!< in: remaining part in update undo log
  818. record of a suitable type, at the start of
  819. the stored index columns;
  820. NOTE that this copy of the undo log record must
  821. be preserved as long as the partial row is
  822. used, as we do NOT copy the data in the
  823. record! */
  824. dict_index_t* index, /*!< in: clustered index */
  825. dtuple_t** row, /*!< out, own: partial row */
  826. ibool ignore_prefix, /*!< in: flag to indicate if we
  827. expect blob prefixes in undo. Used
  828. only in the assertion. */
  829. mem_heap_t* heap) /*!< in: memory heap from which the memory
  830. needed is allocated */
  831. {
  832. const byte* end_ptr;
  833. ulint row_len;
  834. ut_ad(index);
  835. ut_ad(ptr);
  836. ut_ad(row);
  837. ut_ad(heap);
  838. ut_ad(dict_index_is_clust(index));
  839. row_len = dict_table_get_n_cols(index->table);
  840. *row = dtuple_create(heap, row_len);
  841. dict_table_copy_types(*row, index->table);
  842. end_ptr = ptr + mach_read_from_2(ptr);
  843. ptr += 2;
  844. while (ptr != end_ptr) {
  845. dfield_t* dfield;
  846. byte* field;
  847. ulint field_no;
  848. const dict_col_t* col;
  849. ulint col_no;
  850. ulint len;
  851. ulint orig_len;
  852. ptr = trx_undo_update_rec_get_field_no(ptr, &field_no);
  853. col = dict_index_get_nth_col(index, field_no);
  854. col_no = dict_col_get_no(col);
  855. ptr = trx_undo_rec_get_col_val(ptr, &field, &len, &orig_len);
  856. dfield = dtuple_get_nth_field(*row, col_no);
  857. dfield_set_data(dfield, field, len);
  858. if (len != UNIV_SQL_NULL
  859. && len >= UNIV_EXTERN_STORAGE_FIELD) {
  860. dfield_set_len(dfield,
  861. len - UNIV_EXTERN_STORAGE_FIELD);
  862. dfield_set_ext(dfield);
  863. /* If the prefix of this column is indexed,
  864. ensure that enough prefix is stored in the
  865. undo log record. */
  866. if (!ignore_prefix && col->ord_part) {
  867. ut_a(dfield_get_len(dfield)
  868. >= 2 * BTR_EXTERN_FIELD_REF_SIZE);
  869. ut_a(dict_table_get_format(index->table)
  870. >= DICT_TF_FORMAT_ZIP
  871. || dfield_get_len(dfield)
  872. >= REC_MAX_INDEX_COL_LEN
  873. + BTR_EXTERN_FIELD_REF_SIZE);
  874. }
  875. }
  876. }
  877. return(ptr);
  878. }
  879. #endif /* !UNIV_HOTBACKUP */
  880. /***********************************************************************//**
  881. Erases the unused undo log page end.
  882. @return TRUE if the page contained something, FALSE if it was empty */
  883. static __attribute__((nonnull))
  884. ibool
  885. trx_undo_erase_page_end(
  886. /*====================*/
  887. page_t* undo_page, /*!< in/out: undo page whose end to erase */
  888. mtr_t* mtr) /*!< in/out: mini-transaction */
  889. {
  890. ulint first_free;
  891. first_free = mach_read_from_2(undo_page + TRX_UNDO_PAGE_HDR
  892. + TRX_UNDO_PAGE_FREE);
  893. memset(undo_page + first_free, 0xff,
  894. (UNIV_PAGE_SIZE - FIL_PAGE_DATA_END) - first_free);
  895. mlog_write_initial_log_record(undo_page, MLOG_UNDO_ERASE_END, mtr);
  896. return(first_free != TRX_UNDO_PAGE_HDR + TRX_UNDO_PAGE_HDR_SIZE);
  897. }
  898. /***********************************************************//**
  899. Parses a redo log record of erasing of an undo page end.
  900. @return end of log record or NULL */
  901. UNIV_INTERN
  902. byte*
  903. trx_undo_parse_erase_page_end(
  904. /*==========================*/
  905. byte* ptr, /*!< in: buffer */
  906. byte* end_ptr __attribute__((unused)), /*!< in: buffer end */
  907. page_t* page, /*!< in: page or NULL */
  908. mtr_t* mtr) /*!< in: mtr or NULL */
  909. {
  910. ut_ad(ptr && end_ptr);
  911. if (page == NULL) {
  912. return(ptr);
  913. }
  914. trx_undo_erase_page_end(page, mtr);
  915. return(ptr);
  916. }
  917. #ifndef UNIV_HOTBACKUP
  918. /***********************************************************************//**
  919. Writes information to an undo log about an insert, update, or a delete marking
  920. of a clustered index record. This information is used in a rollback of the
  921. transaction and in consistent reads that must look to the history of this
  922. transaction.
  923. @return DB_SUCCESS or error code */
  924. UNIV_INTERN
  925. ulint
  926. trx_undo_report_row_operation(
  927. /*==========================*/
  928. ulint flags, /*!< in: if BTR_NO_UNDO_LOG_FLAG bit is
  929. set, does nothing */
  930. ulint op_type, /*!< in: TRX_UNDO_INSERT_OP or
  931. TRX_UNDO_MODIFY_OP */
  932. que_thr_t* thr, /*!< in: query thread */
  933. dict_index_t* index, /*!< in: clustered index */
  934. const dtuple_t* clust_entry, /*!< in: in the case of an insert,
  935. index entry to insert into the
  936. clustered index, otherwise NULL */
  937. const upd_t* update, /*!< in: in the case of an update,
  938. the update vector, otherwise NULL */
  939. ulint cmpl_info, /*!< in: compiler info on secondary
  940. index updates */
  941. const rec_t* rec, /*!< in: in case of an update or delete
  942. marking, the record in the clustered
  943. index, otherwise NULL */
  944. roll_ptr_t* roll_ptr) /*!< out: rollback pointer to the
  945. inserted undo log record,
  946. ut_dulint_zero if BTR_NO_UNDO_LOG
  947. flag was specified */
  948. {
  949. trx_t* trx;
  950. trx_undo_t* undo;
  951. ulint page_no;
  952. trx_rseg_t* rseg;
  953. mtr_t mtr;
  954. ulint err = DB_SUCCESS;
  955. mem_heap_t* heap = NULL;
  956. ulint offsets_[REC_OFFS_NORMAL_SIZE];
  957. ulint* offsets = offsets_;
  958. #ifdef UNIV_DEBUG
  959. int loop_count = 0;
  960. #endif /* UNIV_DEBUG */
  961. rec_offs_init(offsets_);
  962. ut_a(dict_index_is_clust(index));
  963. if (flags & BTR_NO_UNDO_LOG_FLAG) {
  964. *roll_ptr = ut_dulint_zero;
  965. return(DB_SUCCESS);
  966. }
  967. ut_ad(thr);
  968. ut_ad((op_type != TRX_UNDO_INSERT_OP)
  969. || (clust_entry && !update && !rec));
  970. trx = thr_get_trx(thr);
  971. rseg = trx->rseg;
  972. mutex_enter(&(trx->undo_mutex));
  973. /* If the undo log is not assigned yet, assign one */
  974. if (op_type == TRX_UNDO_INSERT_OP) {
  975. if (trx->insert_undo == NULL) {
  976. err = trx_undo_assign_undo(trx, TRX_UNDO_INSERT);
  977. }
  978. undo = trx->insert_undo;
  979. if (UNIV_UNLIKELY(!undo)) {
  980. /* Did not succeed */
  981. mutex_exit(&(trx->undo_mutex));
  982. return(err);
  983. }
  984. } else {
  985. ut_ad(op_type == TRX_UNDO_MODIFY_OP);
  986. if (trx->update_undo == NULL) {
  987. err = trx_undo_assign_undo(trx, TRX_UNDO_UPDATE);
  988. }
  989. undo = trx->update_undo;
  990. if (UNIV_UNLIKELY(!undo)) {
  991. /* Did not succeed */
  992. mutex_exit(&(trx->undo_mutex));
  993. return(err);
  994. }
  995. offsets = rec_get_offsets(rec, index, offsets,
  996. ULINT_UNDEFINED, &heap);
  997. }
  998. page_no = undo->last_page_no;
  999. mtr_start(&mtr);
  1000. do {
  1001. buf_block_t* undo_block;
  1002. page_t* undo_page;
  1003. ulint offset;
  1004. undo_block = buf_page_get_gen(undo->space, undo->zip_size,
  1005. page_no, RW_X_LATCH,
  1006. undo->guess_block, BUF_GET,
  1007. __FILE__, __LINE__, &mtr);
  1008. buf_block_dbg_add_level(undo_block, SYNC_TRX_UNDO_PAGE);
  1009. undo_page = buf_block_get_frame(undo_block);
  1010. if (op_type == TRX_UNDO_INSERT_OP) {
  1011. offset = trx_undo_page_report_insert(
  1012. undo_page, trx, index, clust_entry, &mtr);
  1013. } else {
  1014. offset = trx_undo_page_report_modify(
  1015. undo_page, trx, index, rec, offsets, update,
  1016. cmpl_info, &mtr);
  1017. }
  1018. if (UNIV_UNLIKELY(offset == 0)) {
  1019. /* The record did not fit on the page. We erase the
  1020. end segment of the undo log page and write a log
  1021. record of it: this is to ensure that in the debug
  1022. version the replicate page constructed using the log
  1023. records stays identical to the original page */
  1024. if (!trx_undo_erase_page_end(undo_page, &mtr)) {
  1025. /* The record did not fit on an empty
  1026. undo page. Discard the freshly allocated
  1027. page and return an error. */
  1028. /* When we remove a page from an undo
  1029. log, this is analogous to a
  1030. pessimistic insert in a B-tree, and we
  1031. must reserve the counterpart of the
  1032. tree latch, which is the rseg
  1033. mutex. We must commit the mini-transaction
  1034. first, because it may be holding lower-level
  1035. latches, such as SYNC_FSP and SYNC_FSP_PAGE. */
  1036. mtr_commit(&mtr);
  1037. mtr_start(&mtr);
  1038. mutex_enter(&rseg->mutex);
  1039. trx_undo_free_last_page(trx, undo, &mtr);
  1040. mutex_exit(&rseg->mutex);
  1041. err = DB_TOO_BIG_RECORD;
  1042. goto err_exit;
  1043. }
  1044. mtr_commit(&mtr);
  1045. } else {
  1046. /* Success */
  1047. mtr_commit(&mtr);
  1048. undo->empty = FALSE;
  1049. undo->top_page_no = page_no;
  1050. undo->top_offset = offset;
  1051. undo->top_undo_no = trx->undo_no;
  1052. undo->guess_block = undo_block;
  1053. UT_DULINT_INC(trx->undo_no);
  1054. mutex_exit(&trx->undo_mutex);
  1055. *roll_ptr = trx_undo_build_roll_ptr(
  1056. op_type == TRX_UNDO_INSERT_OP,
  1057. rseg->id, page_no, offset);
  1058. err = DB_SUCCESS;
  1059. goto func_exit;
  1060. }
  1061. ut_ad(page_no == undo->last_page_no);
  1062. /* We have to extend the undo log by one page */
  1063. ut_ad(++loop_count < 2);
  1064. mtr_start(&mtr);
  1065. /* When we add a page to an undo log, this is analogous to
  1066. a pessimistic insert in a B-tree, and we must reserve the
  1067. counterpart of the tree latch, which is the rseg mutex. */
  1068. mutex_enter(&(rseg->mutex));
  1069. page_no = trx_undo_add_page(trx, undo, &mtr);
  1070. mutex_exit(&(rseg->mutex));
  1071. } while (UNIV_LIKELY(page_no != FIL_NULL));
  1072. /* Did not succeed: out of space */
  1073. err = DB_OUT_OF_FILE_SPACE;
  1074. err_exit:
  1075. mutex_exit(&trx->undo_mutex);
  1076. mtr_commit(&mtr);
  1077. func_exit:
  1078. if (UNIV_LIKELY_NULL(heap)) {
  1079. mem_heap_free(heap);
  1080. }
  1081. return(err);
  1082. }
  1083. /*============== BUILDING PREVIOUS VERSION OF A RECORD ===============*/
  1084. /******************************************************************//**
  1085. Copies an undo record to heap. This function can be called if we know that
  1086. the undo log record exists.
  1087. @return own: copy of the record */
  1088. UNIV_INTERN
  1089. trx_undo_rec_t*
  1090. trx_undo_get_undo_rec_low(
  1091. /*======================*/
  1092. roll_ptr_t roll_ptr, /*!< in: roll pointer to record */
  1093. mem_heap_t* heap) /*!< in: memory heap where copied */
  1094. {
  1095. trx_undo_rec_t* undo_rec;
  1096. ulint rseg_id;
  1097. ulint page_no;
  1098. ulint offset;
  1099. const page_t* undo_page;
  1100. trx_rseg_t* rseg;
  1101. ibool is_insert;
  1102. mtr_t mtr;
  1103. trx_undo_decode_roll_ptr(roll_ptr, &is_insert, &rseg_id, &page_no,
  1104. &offset);
  1105. rseg = trx_rseg_get_on_id(rseg_id);
  1106. mtr_start(&mtr);
  1107. undo_page = trx_undo_page_get_s_latched(rseg->space, rseg->zip_size,
  1108. page_no, &mtr);
  1109. undo_rec = trx_undo_rec_copy(undo_page + offset, heap);
  1110. mtr_commit(&mtr);
  1111. return(undo_rec);
  1112. }
  1113. /******************************************************************//**
  1114. Copies an undo record to heap.
  1115. NOTE: the caller must have latches on the clustered index page and
  1116. purge_view.
  1117. @return DB_SUCCESS, or DB_MISSING_HISTORY if the undo log has been
  1118. truncated and we cannot fetch the old version */
  1119. UNIV_INTERN
  1120. ulint
  1121. trx_undo_get_undo_rec(
  1122. /*==================*/
  1123. roll_ptr_t roll_ptr, /*!< in: roll pointer to record */
  1124. trx_id_t trx_id, /*!< in: id of the trx that generated
  1125. the roll pointer: it points to an
  1126. undo log of this transaction */
  1127. trx_undo_rec_t** undo_rec, /*!< out, own: copy of the record */
  1128. mem_heap_t* heap) /*!< in: memory heap where copied */
  1129. {
  1130. #ifdef UNIV_SYNC_DEBUG
  1131. ut_ad(rw_lock_own(&(purge_sys->latch), RW_LOCK_SHARED));
  1132. #endif /* UNIV_SYNC_DEBUG */
  1133. if (!trx_purge_update_undo_must_exist(trx_id)) {
  1134. /* It may be that the necessary undo log has already been
  1135. deleted */
  1136. return(DB_MISSING_HISTORY);
  1137. }
  1138. *undo_rec = trx_undo_get_undo_rec_low(roll_ptr, heap);
  1139. return(DB_SUCCESS);
  1140. }
  1141. /*******************************************************************//**
  1142. Build a previous version of a clustered index record. This function checks
  1143. that the caller has a latch on the index page of the clustered index record
  1144. and an s-latch on the purge_view. This guarantees that the stack of versions
  1145. is locked all the way down to the purge_view.
  1146. @return DB_SUCCESS, or DB_MISSING_HISTORY if the previous version is
  1147. earlier than purge_view, which means that it may have been removed,
  1148. DB_ERROR if corrupted record */
  1149. UNIV_INTERN
  1150. ulint
  1151. trx_undo_prev_version_build(
  1152. /*========================*/
  1153. const rec_t* index_rec,/*!< in: clustered index record in the
  1154. index tree */
  1155. mtr_t* index_mtr __attribute__((unused)),
  1156. /*!< in: mtr which contains the latch to
  1157. index_rec page and purge_view */
  1158. const rec_t* rec, /*!< in: version of a clustered index record */
  1159. dict_index_t* index, /*!< in: clustered index */
  1160. ulint* offsets,/*!< in: rec_get_offsets(rec, index) */
  1161. mem_heap_t* heap, /*!< in: memory heap from which the memory
  1162. needed is allocated */
  1163. rec_t** old_vers)/*!< out, own: previous version, or NULL if
  1164. rec is the first inserted version, or if
  1165. history data has been deleted (an error),
  1166. or if the purge COULD have removed the version
  1167. though it has not yet done so */
  1168. {
  1169. trx_undo_rec_t* undo_rec = NULL;
  1170. dtuple_t* entry;
  1171. trx_id_t rec_trx_id;
  1172. ulint type;
  1173. undo_no_t undo_no;
  1174. dulint table_id;
  1175. trx_id_t trx_id;
  1176. roll_ptr_t roll_ptr;
  1177. roll_ptr_t old_roll_ptr;
  1178. upd_t* update;
  1179. byte* ptr;
  1180. ulint info_bits;
  1181. ulint cmpl_info;
  1182. ibool dummy_extern;
  1183. byte* buf;
  1184. ulint err;
  1185. #ifdef UNIV_SYNC_DEBUG
  1186. ut_ad(rw_lock_own(&(purge_sys->latch), RW_LOCK_SHARED));
  1187. #endif /* UNIV_SYNC_DEBUG */
  1188. ut_ad(mtr_memo_contains_page(index_mtr, index_rec, MTR_MEMO_PAGE_S_FIX)
  1189. || mtr_memo_contains_page(index_mtr, index_rec,
  1190. MTR_MEMO_PAGE_X_FIX));
  1191. ut_ad(rec_offs_validate(rec, index, offsets));
  1192. if (!dict_index_is_clust(index)) {
  1193. fprintf(stderr, "InnoDB: Error: trying to access"
  1194. " update undo rec for non-clustered index %s\n"
  1195. "InnoDB: Submit a detailed bug report to"
  1196. " http://bugs.mysql.com\n"
  1197. "InnoDB: index record ", index->name);
  1198. rec_print(stderr, index_rec, index);
  1199. fputs("\n"
  1200. "InnoDB: record version ", stderr);
  1201. rec_print_new(stderr, rec, offsets);
  1202. putc('\n', stderr);
  1203. return(DB_ERROR);
  1204. }
  1205. roll_ptr = row_get_rec_roll_ptr(rec, index, offsets);
  1206. old_roll_ptr = roll_ptr;
  1207. *old_vers = NULL;
  1208. if (trx_undo_roll_ptr_is_insert(roll_ptr)) {
  1209. /* The record rec is the first inserted version */
  1210. return(DB_SUCCESS);
  1211. }
  1212. rec_trx_id = row_get_rec_trx_id(rec, index, offsets);
  1213. err = trx_undo_get_undo_rec(roll_ptr, rec_trx_id, &undo_rec, heap);
  1214. if (UNIV_UNLIKELY(err != DB_SUCCESS)) {
  1215. /* The undo record may already have been purged.
  1216. This should never happen in InnoDB. */
  1217. return(err);
  1218. }
  1219. ptr = trx_undo_rec_get_pars(undo_rec, &type, &cmpl_info,
  1220. &dummy_extern, &undo_no, &table_id);
  1221. ptr = trx_undo_update_rec_get_sys_cols(ptr, &trx_id, &roll_ptr,
  1222. &info_bits);
  1223. /* (a) If a clustered index record version is such that the
  1224. trx id stamp in it is bigger than purge_sys->view, then the
  1225. BLOBs in that version are known to exist (the purge has not
  1226. progressed that far);
  1227. (b) if the version is the first version such that trx id in it
  1228. is less than purge_sys->view, and it is not delete-marked,
  1229. then the BLOBs in that version are known to exist (the purge
  1230. cannot have purged the BLOBs referenced by that version
  1231. yet).
  1232. This function does not fetch any BLOBs. The callers might, by
  1233. possibly invoking row_ext_create() via row_build(). However,
  1234. they should have all needed information in the *old_vers
  1235. returned by this function. This is because *old_vers is based
  1236. on the transaction undo log records. The function
  1237. trx_undo_page_fetch_ext() will write BLOB prefixes to the
  1238. transaction undo log that are at least as long as the longest
  1239. possible column prefix in a secondary index. Thus, secondary
  1240. index entries for *old_vers can be constructed without
  1241. dereferencing any BLOB pointers. */
  1242. ptr = trx_undo_rec_skip_row_ref(ptr, index);
  1243. ptr = trx_undo_update_rec_get_update(ptr, index, type, trx_id,
  1244. roll_ptr, info_bits,
  1245. NULL, heap, &update);
  1246. if (ut_dulint_cmp(table_id, index->table->id) != 0) {
  1247. ptr = NULL;
  1248. fprintf(stderr,
  1249. "InnoDB: Error: trying to access update undo rec"
  1250. " for table %s\n"
  1251. "InnoDB: but the table id in the"
  1252. " undo record is wrong\n"
  1253. "InnoDB: Submit a detailed bug report"
  1254. " to http://bugs.mysql.com\n"
  1255. "InnoDB: Run also CHECK TABLE %s\n",
  1256. index->table_name, index->table_name);
  1257. }
  1258. if (ptr == NULL) {
  1259. /* The record was corrupted, return an error; these printfs
  1260. should catch an elusive bug in row_vers_old_has_index_entry */
  1261. fprintf(stderr,
  1262. "InnoDB: table %s, index %s, n_uniq %lu\n"
  1263. "InnoDB: undo rec address %p, type %lu cmpl_info %lu\n"
  1264. "InnoDB: undo rec table id %lu %lu,"
  1265. " index table id %lu %lu\n"
  1266. "InnoDB: dump of 150 bytes in undo rec: ",
  1267. index->table_name, index->name,
  1268. (ulong) dict_index_get_n_unique(index),
  1269. undo_rec, (ulong) type, (ulong) cmpl_info,
  1270. (ulong) ut_dulint_get_high(table_id),
  1271. (ulong) ut_dulint_get_low(table_id),
  1272. (ulong) ut_dulint_get_high(index->table->id),
  1273. (ulong) ut_dulint_get_low(index->table->id));
  1274. ut_print_buf(stderr, undo_rec, 150);
  1275. fputs("\n"
  1276. "InnoDB: index record ", stderr);
  1277. rec_print(stderr, index_rec, index);
  1278. fputs("\n"
  1279. "InnoDB: record version ", stderr);
  1280. rec_print_new(stderr, rec, offsets);
  1281. fprintf(stderr, "\n"
  1282. "InnoDB: Record trx id " TRX_ID_FMT
  1283. ", update rec trx id " TRX_ID_FMT "\n"
  1284. "InnoDB: Roll ptr in rec %lu %lu, in update rec"
  1285. " %lu %lu\n",
  1286. TRX_ID_PREP_PRINTF(rec_trx_id),
  1287. TRX_ID_PREP_PRINTF(trx_id),
  1288. (ulong) ut_dulint_get_high(old_roll_ptr),
  1289. (ulong) ut_dulint_get_low(old_roll_ptr),
  1290. (ulong) ut_dulint_get_high(roll_ptr),
  1291. (ulong) ut_dulint_get_low(roll_ptr));
  1292. trx_purge_sys_print();
  1293. return(DB_ERROR);
  1294. }
  1295. # if defined UNIV_DEBUG || defined UNIV_BLOB_LIGHT_DEBUG
  1296. ut_a(!rec_offs_any_null_extern(rec, offsets));
  1297. # endif /* UNIV_DEBUG || UNIV_BLOB_LIGHT_DEBUG */
  1298. if (row_upd_changes_field_size_or_external(index, offsets, update)) {
  1299. ulint n_ext;
  1300. /* We have to set the appropriate extern storage bits in the
  1301. old version of the record: the extern bits in rec for those
  1302. fields that update does NOT update, as well as the bits for
  1303. those fields that update updates to become externally stored
  1304. fields. Store the info: */
  1305. entry = row_rec_to_index_entry(ROW_COPY_DATA, rec, index,
  1306. offsets, &n_ext, heap);
  1307. n_ext += btr_push_update_extern_fields(entry, update, heap);
  1308. /* The page containing the clustered index record
  1309. corresponding to entry is latched in mtr. Thus the
  1310. following call is safe. */
  1311. row_upd_index_replace_new_col_vals(entry, index, update, heap);
  1312. buf = mem_heap_alloc(heap, rec_get_converted_size(index, entry,
  1313. n_ext));
  1314. *old_vers = rec_convert_dtuple_to_rec(buf, index,
  1315. entry, n_ext);
  1316. } else {
  1317. buf = mem_heap_alloc(heap, rec_offs_size(offsets));
  1318. *old_vers = rec_copy(buf, rec, offsets);
  1319. rec_offs_make_valid(*old_vers, index, offsets);
  1320. row_upd_rec_in_place(*old_vers, index, offsets, update, NULL);
  1321. }
  1322. return(DB_SUCCESS);
  1323. }
  1324. #endif /* !UNIV_HOTBACKUP */