Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
BUG#19579811: SET GTID_PURGED DOES NOT STOP WAIT_FOR_EXECUTED_GTID_SET
Background: WAIT_FOR_EXECUTED_GTID_SET waits until a specified set of GTIDs is included in GTID_EXECUTED. SET GTID_PURGED adds GTIDs to GTID_EXECUTED. RESET MASTER clears GTID_EXECUTED. There were multiple issues: 1. Problem: The change in GTID_EXECUTED implied by SET GTID_PURGED did not cause WAIT_FOR_EXECUTED_GTID_SET to stop waiting. Analysis: WAIT_FOR_EXECUTED_GTID_SET waits for a signal to be sent. But SET GTID_PURGED never sent the signal. Fix: Make GTID_PURGED send the signal. Changes: - sql/rpl_gtid_state.cc:Gtid_state::add_lost_gtids - sql/rpl_gtid_state.cc: removal of #ifdef HAVE_GTID_NEXT_LIST - sql/rpl_gtid.h: removal of #ifdef HAVE_GTID_NEXT_LIST 2. Problem: There was a race condition where WAIT_FOR_EXECUTED_GTID_SET could miss the signal from a commit and go into an infinite wait even if GTID_EXECUTED contains all the waited-for GTIDs. Analysis: In the bug, WAIT_FOR_EXECUTED_GTID_SET took a lock while taking a copy of the global state. Then it released the lock, analyzed the copy of the global state, and decided whether it should wait. But if the GTID to wait for was committed after the lock was released, WAIT_FOR_EXECUTED_GTID_SET would miss the signal and go to an infinite wait even if GTID_EXECUTED contains all the waited-for GTIDs. Fix: Refactor the code so that it holds the lock all the way from before it reads the global state until it goes to the wait. Changes: - sql/rpl_gtid_state.cc:Gtid_state::wait_for_gtid_set: Most of the changes in this function are to fix this bug. Note: When the bug existed, it was possible to create a test case for this by placing a debug sync point in the section where it does not hold the lock. However, after the bug has been fixed this section does not exist, so there is no way to test it deterministically. The bug would also cause the test to fail rarely, so a way to test this is to run the test case 1000 times. 3. Problem: The function would take global_sid_lock.wrlock every time it has to wait, and while holding it takes a copy of the entire gtid_executed (which implies allocating memory). This is not very optimal: it may process the entire set each time it waits, and it may wait once for each member of the set, so in the worst case it is O(N^2) where N is the size of the set. Fix: This is fixed by the same refactoring that fixes problem #2. In particular, it does not re-process the entire Gtid_set for each committed transaction. It only removes all intervals of gtid_executed for the current sidno from the remainder of the wait-for-set. Changes: - sql/rpl_gtid_set.cc: Add function remove_intervals_for_sidno. - sql/rpl_gtid_state.cc: Use remove_intervals_for_sidno and remove only intervals for the current sidno. Remove intervals incrementally in the innermost while loop, rather than recompute the entire set each iteration. 4. Problem: If the client that executes WAIT_FOR_EXECUTED_GTID_SET owns a GTID that is included in the set, then there is no chance for another thread to commit it, so it will wait forever. In effect, it deadlocks with itself. Fix: Detect the situation and generate an error. Changes: - sql/share/errmsg-utf8.txt: new error code ER_CANT_WAIT_FOR_EXECUTED_GTID_SET_WHILE_OWNING_A_GTID - sql/item_func.cc: check the condition and generate the new error 5. Various simplfications. - sql/item_func.cc:Item_wait_for_executed_gtid_set::val_int: - Pointless to set null_value when generating an error. - add DBUG_ENTER - Improve the prototype for Gtid_state::wait_for_gtid_set so that it takes a Gtid_set instead of a string, and also so that it requires global_sid_lock. - sql/rpl_gtid.h:Mutex_cond_array - combine wait functions into one and make it return bool - improve some comments - sql/rpl_gtid_set.cc:Gtid_set::remove_gno_intervals: - Optimize so that it returns early if this set becomes empty @mysql-test/extra/rpl_tests/rpl_wait_for_executed_gtid_set.inc - Move all wait_for_executed_gtid_set tests into mysql-test/suite/rpl/t/rpl_wait_for_executed_gtid_set.test @mysql-test/include/kill_wait_for_executed_gtid_set.inc @mysql-test/include/wait_for_wait_for_executed_gtid_set.inc - New auxiliary scripts. @mysql-test/include/rpl_init.inc - Document undocumented side effect. @mysql-test/suite/rpl/r/rpl_wait_for_executed_gtid_set.result - Update result file. @mysql-test/suite/rpl/t/rpl_wait_for_executed_gtid_set.test - Rewrote the test to improve coverage and cover all parts of this bug. @sql/item_func.cc - Add DBUG_ENTER - No point in setting null_value when generating an error. - Do the decoding from text to Gtid_set here rather than in Gtid_state. - Check for the new error ER_CANT_WAIT_FOR_EXECUTED_GTID_SET_WHILE_OWNING_A_GTID @sql/rpl_gtid.h - Simplify the Mutex_cond_array::wait functions in the following ways: - Make them one function since they share most code. This also allows calling the three-argument function with NULL as the last parameter, which simplifies the caller. - Make it return bool rather than 0/ETIME/ETIMEOUT, to make it more easy to use. - Make is_thd_killed private. - Add prototype for new Gtid_set::remove_intervals_for_sidno. - Add prototype for Gtid_state::wait_for_sidno. - Un-ifdef-out lock_sidnos/unlock_sidnos/broadcast_sidnos since we now need them. - Make wait_for_gtid_set return bool. @sql/rpl_gtid_mutex_cond_array.cc - Remove the now unused check_thd_killed. @sql/rpl_gtid_set.cc - Optimize Gtid_set::remove_gno_intervals, so that it returns early if the Interval list becomes empty. - Add Gtid_set::remove_intervals_for_sidno. This is just a wrapper around the already existing private member function Gtid_set::remove_gno_intervals. @sql/rpl_gtid_state.cc - Rewrite wait_for_gtid_set to fix problems 2 and 3. See code comment for details. - Factor out wait_for_sidno from wait_for_gtid. - Enable broadcast_sidnos/lock_sidnos/unlock_sidnos, which were ifdef'ed out. - Call broadcast_sidnos after updating the state, to fix issue #1. @sql/share/errmsg-utf8.txt - Add error message used to fix issue #4.
- Loading branch information