Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
37477: storage: Fix deadlock in TestStoreRangeMergeSlowWatcher r=andy-kimball a=andy-kimball This test is occasionally flaking under heavy race stress in CI runs. Here is the probable sequence of events: 1. A<-B merge starts and Subsume request locks down B. 2. Watcher on B sends PushTxn request, which is intercepted by the TestingRequestFilter in the test. 3. The merge txn aborts due to interference with the replica GC, since it's concurrently reading range descriptors. 4. The merge txn is retried, but the Watcher on B is locking the range, so it aborts again. 5. #4 repeats until the allowPushTxn channel fills up (it has capacity of 10). This causes a deadlock because the merge txn can't continue. Meanwhile, the watcher is blocked waiting for the results of the PushTxn request, which gets blocked waiting for the merge txn. The fix is to get rid of the arbitrarily limited channel size of 10 and use sync.Cond synchronization instead. Multiple retries of the merge txn will repeatedly signal the Cond, rather than fill up the channel. One of the problems with the channel was that there can be an imbalance between the number of items sent to the channel (by merge txns) with the number of items received from the channel (by the watcher). This imbalance meant the channel gradually filled up until finally the right sequence of events caused deadlock. Using a sync.Cond also fixes a race condition I saw several times, in which the merge transaction tries to send to the channel while it is being concurrently closed. Release note: None Co-authored-by: Andrew Kimball <[email protected]>
- Loading branch information