Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set ZMQ receive timeouts in the interchange to prevent blocking threads #2625

Closed
wants to merge 1 commit into from

Conversation

yadudoc
Copy link
Member

@yadudoc yadudoc commented Mar 13, 2023

Description

Without the RCVTIMEO the task_puller thread and _command_server on the interchange block indefinitely and do not break when the kill_event is set. Since the recv_pyobj call is in c-land it has a tendency to ignore signals.

This is rolling back changes from this PR: #2438
While we do not wait for the kill event, we still need the thread exit from the ZMQ context where it seems to ignore signals.

Currently, this is blocking/causing hangs on funcX tests wherever we use HTEX.

Type of change

Choose which options apply, and delete the ones which do not apply.

  • Bug fix (non-breaking change that fixes an issue)

… HTEX.interchange

Without the rcvtimeout the task_puller thread and _command_server block indefinitely
and do not break when the kill_event is set. Since the recv_pyobj call is in c-land
it has a tendency to ignore signals.
@yadudoc yadudoc requested a review from benclifford March 13, 2023 23:11
@benclifford
Copy link
Collaborator

can you add some context/details for reproducing the problem? this is exiting fine with signals in other cases, I think, so I'd like to understand what is different in your test setup.

@benclifford
Copy link
Collaborator

I talked to Kevin who is working on funcX for more context about this.

My understanding is this is to address an issue with the test suite hanging which appears in globus/globus-compute#1057

I have replicated what I think is the same hang on my laptop, and investigated it.
However this PR does not fix that hang on my laptop.

The hang on my laptop comes from other parts of the funcX test suite installing a different handler for SIGTERM, and then the htex interchange inheriting that handler. This is bad behaviour by both the funcx test suite and the htex interchange. I opened an issue for the htex side of things - #2628

If this current PR #2625 does fix things for you, I'm interested in further investigating. If it turns out that it was because you also had a signal handler override hacked into your interchange, outside of this PR, then probably this PR can be closed without merge.

Copy link
Collaborator

@benclifford benclifford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@benclifford
Copy link
Collaborator

changing this to draft because I think it can probably be closed but this can wait for yadu to look at later

@benclifford benclifford marked this pull request as draft March 20, 2023 17:27
@benclifford
Copy link
Collaborator

closing this unmerged -- I talked with @yadudoc and it looks like already-merged #2629 fixes this as I hoped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants