-
-
Notifications
You must be signed in to change notification settings - Fork 660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] sporadic segfaults from joblib tests in CircleCI #421
Comments
The relevant part of the CircleCI logs is:
The segfault involves the loky backend, namely the files: Loky is the default backend used by Unfortunately, I cannot reproduce the segfault on my linux machine. Can we identify which of the tests in |
It's difficult to reproduce the error even on CircleCI. I tried the following steps:
This ran for over an hour (25+ iterations) and unfortunately did not yield any failures. I already reached more than 60% of my CircleCI free-plan budget (tried other things before). Any suggestions what to do differently? |
I was thinking to use ssh, but maybe this is more likely to happen the first time only. You can change the noxfile to control what is getting executed on circle. (for example, change the default to verbose and disable jobs other that the one with the issue). Re: circleci |
About 20% of the top level runs are crashing. (not sure about the actual crash percentage, each one represents 8 runs across different operating systems any Python versions (actually 6 runs considering Windows is skipped right now). |
Hello! I just saw your ping. This is a weird behavior indeed, especially the fact that it only happens in I am interested in understanding why this is happening so we can fix this in |
Hi Tom, thanks for your comment. General thoughts (not something that I have seen here specifically):
At the moment this is rather hard to reproduce. Hydra is pure Python, at the moment it's code is not printing or dumping the stack trace.
not sure how to do that from Python. This is probably better done interactively. can you join the Hydra chat? (I may be a bit sporadic today and tomorrow). |
Yes there are some clean up logic. But as the threads have references on the waited mutex, I don't think this is the case where the mutex got cleaned up. In addition, the traceback show that the threads are still in the correct function so I doubt this is th esource of the segfault.
Yes, as it does not happen deterministically, it is most probably caused by some kind of race condition, something cleaned up at the wrong time. I think the most probable would be to look around the memap, that might be cleaned up concurrently (but it should not..)
Do you have more info on this?
In pure Python, you can cause a segfault using import faulthandler
faulthandler._sigsegv()
|
After playing a bit with this, the segfault seems to happen in the main process and most probably in the main thread. I would need to know which test fail here. Isn't it possible to re run the test suit with |
I confirm that I could not reproduce locally either. I ran the tests for the plugins a couple of times with:
to enable the verbose mode in pytest. However I am not sure how to tel nox / pytest to launch the joblib plugin test only. If I run:
then I get failures such as:
|
Thanks @tomMoral. Hi @ogrisel! Answering your questions: This will still run the Hydra core tests though (which can be a bit slow).
The noxfile is using the setup.py of the plugin to determine what to test it on. By the way, feel free to hack the noxfile if it helps you debug things. (as long as we don't land it, do whatever you need). To run the tests directly, you need to install the plugin first. Unrelated to the joblib issue on 3.6, I ran into a new issue in a branch I will land now that will give me another reason to block 3.6 for this plugin.
With 3.6 it means its my recent changes. This is a cloudpickle issue with serializing annotated types. |
@ogrisel, I didn't realize you are one of the maintainers of cloudpickle! this is awesome :). I said it's a cloudpickle issue but really I didn't dig in. It could also be something else. |
This might be related to this bug that was just fixed in master by @pierreglaser: cloudpipe/cloudpickle#347 This is about silently dropping type annotations on dynamic classes in Python 3.6. If you really need to pickle type annotations on dynamic functions or classes (without silent drop), then there is this PR but it's quite complex and I don't have time to review it at the moment. |
Closing as we have a workaround (officially supporting only 3.7+ with the joblib launcher plugin). |
Example:
https://app.circleci.com/jobs/github/facebookresearch/hydra/8035
@jan-matthis, can you take a look? this one is specifically on Linux.
The text was updated successfully, but these errors were encountered: