Reduce load on drain integration test to reduce race-condition failur…

…es (#3640) This PR reduces the load places on the interchange and on the whole test environment caused by repeatedly querying the interchange for connected managers. It does this by increasing the period between such requests, from the default, every 20ms, to every 100ms. In the last few days, test_drain.py began failing often. I have seen it occasionally fail before. This was initially a problem in PR #3639 which is unrelated, but I recreated the problem in CI against master as of #3627. I investigated and found this behaviour causing the failure: * test_drain configures the drain period to be 1 second * startup of a worker pool was taking more than 1 second * the worker pool enters drain state, drains and terminates immediately on being fully started up. * test_py fails its assumption that there is a worker pool to inspect after waiting for there to be worker pool to inspect. This is the race condition: the assertion at line 57 is true but line 59 returns an empty managers list. ``` 57 try_assert(lambda: len(htex.connected_managers()) == 1, check_period_ms=CONNECTED_MANAGERS_POLL_MS) 58 59 managers = htex.connected_managers() 60 assert managers[0]['active'], "The manager should be active" ``` Looking at the CI logs for a failing case, I saw direct evidence that the worker pool takes more than 1 second to start up in `manager.log`: ``` 2024-10-18 10:31:16.007 parsl:914 29414 MainThread [INFO] Python version: 3.12.7 (main, Oct 1 2024, 15:17:50) [GCC 9.4.0] [...] 2024-10-18 10:31:16.008 parsl:151 29414 MainThread [INFO] Manager initializing [this is where the worker start time for drain purposes is measured] [...] 2024-10-18 10:31:16.011 parsl:183 29414 MainThread [INFO] Manager connected to interchange 2024-10-18 10:31:17.058 parsl:243 29414 MainThread [INFO] Will request drain at 1729247477.0087547 [...] 2024-10-18 10:31:17.073 parsl:336 29414 Task-Puller [DEBUG] ready workers: 0, pending tasks: 0 ``` There's more than a second delay between "... connected to interchange" and the subsequent message "Will request drain". There's not a huge amount of stuff happening between these lines but there are things like multiprocessing initialization which starts a new process. It looks like this bit of code is slow even in the successful case: rerunning until success, I see this timing in CI: ``` 2024-10-18 12:11:17.475 parsl:183 23062 MainThread [INFO] Manager connected to interchange 2024-10-18 12:11:18.181 parsl:243 23062 MainThread [INFO] Will request drain at 1729253478.4731379 ``` which is still a large fraction of a second (but sufficiently less than a second for the test to pass). I haven't investigated what is taking that time. I haven't investigated if I also see that on my laptop. I hypothesised that a lot of these test failures come from the test environment being quite loaded. I'm especially suspicious of using `try_assert` with its default timings which are very tight (20ms) - the connected managers RPC here would be expected to run much less frequently, more like every 5 seconds in regular Parsl use. So I lengthed the period of the try_asserts in this test, to try to reduce load caused there. That makes the test pass repeatedly again. Things not investigated/debugged: * why this is taking >0.5 second even in the successful case - it's possible that this is a reasonable startup time and so the test might be lengthened by a few seconds * how to do a test without being timing reliant - draining is, by its very nature, reliant on the passage of "real time". For example, you might mock (at the libc level if not at the Python level) system time. * what other loads are present on the system - one of the points of slowly-ongoing PR #3397 shutdown tidyup is to reduce thread load left behind by earlier tests
Parsl · Oct 19, 2024 · 973ee48 · 973ee48
1 parent 29f960f
commit 973ee48
Showing 1 changed file with 5 additions and 3 deletions.
diff --git a/parsl/tests/test_htex/test_drain.py b/parsl/tests/test_htex/test_drain.py
@@ -15,6 +15,8 @@
 # last twice that many number of seconds.
 TIME_CONST = 1
 
+CONNECTED_MANAGERS_POLL_MS = 100
+
 
 def local_config():
     return Config(
@@ -52,7 +54,7 @@ def test_drain(try_assert):
 
     # wait till we have a block running...
 
-    try_assert(lambda: len(htex.connected_managers()) == 1)
+    try_assert(lambda: len(htex.connected_managers()) == 1, check_period_ms=CONNECTED_MANAGERS_POLL_MS)
 
     managers = htex.connected_managers()
     assert managers[0]['active'], "The manager should be active"
@@ -63,7 +65,7 @@ def test_drain(try_assert):
     time.sleep(TIME_CONST)
 
     # this assert should happen *very fast* after the above delay...
-    try_assert(lambda: htex.connected_managers()[0]['draining'], timeout_ms=500)
+    try_assert(lambda: htex.connected_managers()[0]['draining'], timeout_ms=500, check_period_ms=CONNECTED_MANAGERS_POLL_MS)
 
     # and the test task should still be running...
     assert not fut.done(), "The test task should still be running"
@@ -76,4 +78,4 @@ def test_drain(try_assert):
     # connected managers.
     # As with the above draining assert, this should happen very fast after
     # the task ends.
-    try_assert(lambda: len(htex.connected_managers()) == 0, timeout_ms=500)
+    try_assert(lambda: len(htex.connected_managers()) == 0, timeout_ms=500, check_period_ms=CONNECTED_MANAGERS_POLL_MS)