-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
self.watch_rest_for_alive(other_node) fails en-mass during bootstrap in debug mode #477
Comments
Also, the 0.1 seconds sleep in scylla-ccm/ccmlib/scylla_node.py Line 1378 in 65872f6
floods the log with messages. See https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-debug/245/artifact/logs-full.debug.094/dtest-gw0.log The following messages are repeated every 0.1 seconds for 360 seconds:
|
|
Logging I'll take care of, I'll disable this for those calls The default timeout should be bigger for debug mode ? 10min as it used to be ? |
If @nyh thinks we still need it than 10 minutes should be enough, at least for small cluster. |
disable `urllib3.connectionpool` logging for this function, it doing lots of retries and we don't need to see each request that is being sent out in the log Ref: scylladb#477
…_alive` seems like the 360s default for debug for `watch_rest_for_alive` isn't enough as it was for `watch_log_for_alive`, making it 600s (10min) Fix: scylladb#477
…r_alive` seems like the 360s default for debug for `watch_rest_for_alive` isn't enough as it was for `watch_log_for_alive`, making it 600s (10min) Fix: scylladb#477
…r_alive` seems like the 360s default for debug for `watch_rest_for_alive` isn't enough as it was for `watch_log_for_alive`, making it 600s (10min) Fix: scylladb#477
disable `urllib3.connectionpool` logging for this function, it doing lots of retries and we don't need to see each request that is being sent out in the log Ref: #477
…r_alive` seems like the 360s default for debug for `watch_rest_for_alive` isn't enough as it was for `watch_log_for_alive`, making it 600s (10min) Fix: #477
Maybe there shouldn't be any timeout at all, whatsoever? If the waiting code is not buggy, it will eventually stop waiting when the node becomes available. If the waiting code is buggy, the caller (dtest, Jenkins, etc.) will eventually stop on an overall timeout, so we're still safe. Alternatively, maybe the timeout should be made configurable. It can default to something (e.g., 10 minutes), but if the user really wants to wait 60 minutes for a node on a huge cluster to come up, they can configure the timeout to 60 minutes. |
I think we can also make the timeout be a linear function of the number of nodes in the cluster (at least for the first node in the loop, the rest can follow shortly after). |
Sigh, apparently 120 seconds in non-debug modes isn't enough and there's flakiness, see
|
not only in the many nodes test.
|
And it turns out that 600 seconds isn't emough either in debug mode if there are enough keyspaces and/or tables. We can extend the timeout indefinitely while there is bootstrap progress like we do for the cql listen message, maybe this is the way to go for test robustness |
@bhalevy I'm not sure what the suggestion is here exactly, this function is waiting for bootstrap to end already... |
Yes, it's polling the rest api and the event it is waiting for doesn't happen until bootstrap is over. But coming to think about it, I think we could just change the order of operation as follows to run wait_timeout = timeout * 4 if timeout is not None else 420 if self.cluster.scylla_mode != 'debug' else 900
if wait_other_notice:
for node, _ in marks:
node.watch_rest_for_alive(self, timeout=wait_timeout)
if wait_for_binary_proto:
from_mark = self.mark
try:
self.wait_for_binary_interface(from_mark=from_mark, process=self._process_scylla, timeout=wait_timeout)
except TimeoutError as e:
self.wait_for_starting(from_mark=self.mark, timeout=wait_timeout)
self.wait_for_binary_interface(from_mark=from_mark, process=self._process_scylla, timeout=0)
elif wait_other_notice:
self.wait_for_starting(from_mark=self.mark, timeout=wait_timeout)
if wait_other_notice:
for node, _ in marks:
self.watch_rest_for_alive(node, timeout=wait_timeout) |
Why not todo all the why split it into two parts like that ? |
it's possible, although the other node notice this node as up before it starts listening for cql, but that shouldn't matter. However, if only wait_other_notics is set, we'd still need |
Even though I don't know of any test that would do wait_other_notice without wait_for_binary_protocol ... |
It should be ok to imply |
disable `urllib3.connectionpool` logging for this function, it doing lots of retries and we don't need to see each request that is being sent out in the log Ref: scylladb#477
…r_alive` seems like the 360s default for debug for `watch_rest_for_alive` isn't enough as it was for `watch_log_for_alive`, making it 600s (10min) Fix: scylladb#477
See for example https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-debug/245/testReport/rebuild_test/TestRebuild/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split094___test_rebuild_many_keyspaces/
It appears like it simply takes more than 360 seconds for the node to bootstrap in debug mode so
watch_rest_for_alive
fails.https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-debug/245/artifact/logs-full.debug.094/1690100633894_rebuild_test.py%3A%3ATestRebuild%3A%3Atest_rebuild_many_keyspaces/node1.log
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-debug/245/artifact/logs-full.debug.094/1690100633894_rebuild_test.py%3A%3ATestRebuild%3A%3Atest_rebuild_many_keyspaces/node2.log
The text was updated successfully, but these errors were encountered: