-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitoring can report to somebody else's monitoring hub #3287
Comments
Inspecting the artifacts (monitoring.db and log files), it looks like there is a PENDING status recorded in the database for two different run IDs. In
The run for this test is
That bad pending message is from a run around 6 seconds before. None of the other monitoring.db databases in this test run contain any other reference to that bad
so then the ZMQ layer would likely have sat around in the pytest processing continuing to try to asynchronously sent a monitoring message - and eventually it would have found one. So two pieces to this issue: i) monitoring startup hung without failing the test - this is, in some part, a (mis?)feature of the original monitoring design, based I think on lack of confidence in monitoring code turning into "we can't let broken monitoring break 'real' parsl" ii) this is the monitoring version of htex issue #2199 - components from one run can connect to the components from another run |
The reason that monitoring router did not continue to log stuff is not because of a hang, but because this 3fde7... run is a test of killing monitoring components - so it killed the monitoring router deliberately. So this test is successfully detecting (although presenting it in a very awkward way) some misbehaviour around there. |
A more principled fix of #3287 looks to be much more deeply invasive, and involves a big rework of how ZMQ is used.
Describe the bug
I saw this in a superficially-unrelated PR #3286
The text was updated successfully, but these errors were encountered: