Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
roachtest: fix bank/cluster-recovery
Informs cockroachdb#38785. The "stalls" detected boil down to the time between chaos monkey iterations. Within each chaos monkey iteration we do the following: - Lock all clients, in sequence - Restart each node - Unlock all clients - Sleep until at least one client has made progress Given our stall timeout is only 30s, we have just that long to go through all of the above. In each client we lock around the UPDATE query so as to not be interrupted. The problem is that every now and then these UPDATE queries take a lot longer than a few milliseconds. This is expected behaviour: this is primarily due to txnwait procedures and having to wait for the expiration of an extant contending txn. More importantly, it's not what we're testing here as the clients are still making progress. Given the chaos monkey first locks each client, it has to drain out these requests, which eats out of the 30s or so we have for each chaos monkey iteration. This is made worse by the fact that we do this in sequence for each client. When we're unlucky, we run into this particular convoy situation and we're unable to finish the round in time, and a "stall" is detected. We should really only be interested in how long it takes for the chaos monkey to restart a set of nodes, and ensuring that after it does, that clients are still making progress. We already have statement timeouts for the UPDATE queries that fail if we take "too long". Removing the stopClients apparatus gives us what we need. Release justification: Category 1: Non-production code changes Release note: None
- Loading branch information