-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky test: "replication_set_should_not_change_if_ring_did_not_change,_when_num_of_instance_per_zone_is_inconsistent" #6513
Comments
After several experiments, I can reliably (8 out of 9 times) get this test to fail by running 4,000,000 iterations of it. I did this a number of ways. The fastest way is launching In one failure I looked at, the "new" replication set contains 2 instances that were NOT in the initial set, and is missing two instances that WERE in the initial set. I haven't looked at enough results to know if that's the normal way it fails. SPECULATION: After skimming the code under test, I don't see any use of concurrency, which rules out race conditions as a possible cause (unless I'm missing something in the ring code). Other possible causes of flakiness that I'm left pondering:
QUESTION: The name of this test is |
I observed what happens in I believe the culprit is this line in Line 482 in 4477add
Map iteration order is undefined so if there are overlapping tokens (i.e. a given token is owned by multiple instances), then which instance is mapped to an overlapping token is non-deterministic. In this test, I propose to fix this by iterating over a sorted slice of keys instead of the map directly. I will submit a patch for us to discuss. |
Signed-off-by: Daniel Sabsay <[email protected]>
Signed-off-by: Daniel Sabsay <[email protected]>
Hi @dsabsay, thanks for looking into this! I spent a couple of hours trying to figure out why this test would fail, but I'm not seeing how ordering the list of ingesters for creating a map of the tokens The test case is described below. "replication set should not change if ring did not change, when num of instance per zone is inconsistent": {
initialInstances: 11,
addedInstances: 0,
removedInstances: 0,
numZones: 3,
replicationFactor: 8,
numDiff: 0,
}, The test seems to be failing because there's an inconsistency between the two Get operations. They should be returning the same set of ingesters, even when the number of instances per zone is inconsistent. The reason the number of instances per zone is inconsistent is because there are 11 ingesters and 3 zones in the test. 11 / 3 is not a whole number (3.67), which means that one zone will have 3 ingesters while the other zones will have 4 ingesters (3 + 4 + 4 = 11). If there were 12 ingesters and 3 zones in the test, then there would be a consistent number of ingesters per zone (4 + 4 + 4). We know that both sets have a count of 8 instances since there are tests that asserts this. set, err := ring.Get(testValues[i], Write, bufDescs, bufHosts, bufZones)
assert.NoError(t, err)
assert.Equal(t, testData.replicationFactor, len(set.Instances)) newSet, err := ring.Get(testValues[i], Write, bufDescs, bufHosts, bufZones)
assert.NoError(t, err)
assert.Equal(t, testData.replicationFactor, len(newSet.Instances)) In between the two tests there are some operations that seem to have mutated the ring to cause the second set to produce a different set of ingesters. ring.ringTokens = ringDesc.GetTokens()
ring.ringTokensByZone = ringDesc.getTokensByZone()
ring.ringInstanceByToken = ringDesc.getTokensInfo()
ring.ringZones = getZones(ringDesc.getTokensByZone()) Our documentation indicates that this test case is proposing an unsupported setup.
We should remove the test cases where RF is greater than the number of zones. Perhaps we could add a condition in the ring creation that validates that when zone awareness is enabled that the RF should be <= number of zones. @alanprot @justinjung04, could you help take a look at this? |
Noticed build failure: https://github.com/cortexproject/cortex/actions/runs/12798522991/job/35682746078?pr=6511
The text was updated successfully, but these errors were encountered: