Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(robot-server): Work around tests hanging because server shutdown is hanging #16317

Merged
merged 2 commits into from
Sep 20, 2024

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented Sep 20, 2024

Overview

This tries to work around robot-server tests hanging on edge's CI checks.

Test Plan and Hands on Testing

See if CI passes now.

Details

It seems like server shutdown hangs when there is an ongoing analysis. I think this is the same problem that @sanni-t identified in #16171. Unfortunately, simply cherry-picking #16171 into edge does not solve the problem.

The last lines from server shutdown:

2024-09-20 15:40:13,172 robot_server.protocols.protocol_analyzer WARNING [Line 154] Analyzer is no longer in use but orchestrator is busy. Cannot stop the orchestrator currently.
2024-09-20 15:40:13,172 uvicorn.error INFO [Line 76] Application shutdown complete.
2024-09-20 15:40:13,172 uvicorn.error INFO [Line 87] Finished server process [2836]
2024-09-20 15:40:13,173 robot_server.service.notifications.publisher_notifier ERROR [Line 55] PublisherNotifer notify task failed
Traceback (most recent call last):
  File "/home/runner/work/opentrons/opentrons/robot-server/robot_server/service/notifications/publisher_notifier.py", line 46, in _wait_for_event
    await self._change_notifier.wait()
  File "/home/runner/work/opentrons/opentrons/api/src/opentrons/util/change_notifier.py", line 18, in wait
    await self._event.wait()
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError
/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/asyncio/base_events.py:674: RuntimeWarning: coroutine 'LegacyContextPlugin._dispatch_action_list' was never awaited
  self._ready.clear()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
[hang]

I think resolving this properly will involve merging chore_release-8.0.0 into edge and then rethinking how ProtocolAnalyzer+RunOrchestrator+ProtocolEngine get cleaned up.

The workaround in this PR is to make the tests wait for all analyses to complete before restarting the server.

Review requests

Does this workaround make sense for now?

Risk assessment

No risk. Changes are just to tests.

@SyntaxColoring SyntaxColoring changed the title chore(robot-server): Work around test_reset.py hanging because server shutdown is hanging chore(robot-server): Work around tests hanging because server shutdown is hanging Sep 20, 2024
@sanni-t
Copy link
Member

sanni-t commented Sep 20, 2024

It's interesting that cherry-picking #16171 didn't fix the problem which means this might be different issue.
You mention in the description that we need to wait for analysis to finish in order to fix this issue but the issue introduced and fixed in #16171 was due to the run orchestrator not cleaning up after the analysis was complete. So I think we first need to understand exactly what is causing this problem. Have you been able to narrow down the test(s) that is causing the hangup?

@SyntaxColoring
Copy link
Contributor Author

SyntaxColoring commented Sep 20, 2024

Have you been able to narrow down the test(s) that is causing the hangup?

It's test_reset.py in CI, but I can't reproduce the hang locally, even when using act to emulate a GitHub Actions environment. So it feels dependent on, like, machine core count and CPU power.

Copy link
Contributor

@TamarZanzouri TamarZanzouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thank you for fixing this!

@SyntaxColoring
Copy link
Contributor Author

I'm going to merge this now because:

  • It's zero-risk (changes are to tests only)
  • There is no clear at-fault PR for us to revert
  • The underlying problem and a proper fix will likely take days to investigate
  • I don't want robot-server tests failing in edge for that long

I agree that we should understand what's actually going on here, and we can continue investigation in EXEC-716.

@SyntaxColoring SyntaxColoring merged commit ce9ac0e into edge Sep 20, 2024
7 checks passed
@SyntaxColoring SyntaxColoring deleted the work_around_hanging_reset_test branch September 20, 2024 19:06
@sanni-t
Copy link
Member

sanni-t commented Sep 20, 2024

Sounds good! Thanks for investigating this!

@sanni-t
Copy link
Member

sanni-t commented Sep 20, 2024

Just remembered that @y3rsh had a PR up to force kill uvicorn processes if a regular shutdown failed. Do you think it's worth adding that as well or does it not matter since the CI flow will be terminated after test cancellation anyway? It might be a good option if we just want the tests to pass right now and the 'wait for analyses some more' workaround is taking a really long time.

@SyntaxColoring
Copy link
Contributor Author

Ooh.

It's reasonable to auto-kill the process after a timeout, but I think it should count as a test failure, not as a success. Basically, it'd turn the GitHub-Action-level timeout into a test-specific timeout.

I wouldn't want the auto-kill to accidentally cover up other shutdown hang bugs, basically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants