Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] sporadic segfaults from joblib tests in CircleCI #421

Closed
omry opened this issue Feb 16, 2020 · 15 comments
Closed

[Bug] sporadic segfaults from joblib tests in CircleCI #421

omry opened this issue Feb 16, 2020 · 15 comments
Labels
bug Something isn't working

Comments

@omry
Copy link
Collaborator

omry commented Feb 16, 2020

Example:

https://app.circleci.com/jobs/github/facebookresearch/hydra/8035

@jan-matthis, can you take a look? this one is specifically on Linux.

  1. I think we need to report to joblib.
  2. I suspect threading (even before looking at the stacktrace). we might want to disable the threading test temporarily.
@omry omry added the bug Something isn't working label Feb 16, 2020
@jan-matthis
Copy link
Contributor

The relevant part of the CircleCI logs is:

tests/test_joblib_launcher.py .......Fatal Python error: Segmentation fault

Thread 0x00007fba5687d700 (most recent call first):
  File "/usr/local/lib/python3.6/threading.py", line 295 in wait
  File "/home/circleci/project/.nox/test_plugins-3-6-pip-install/lib/python3.6/site-packages/joblib/externals/loky/backend/queues.py", line 138 in _feed
  File "/usr/local/lib/python3.6/threading.py", line 864 in run
  File "/usr/local/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/local/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fba5607c700 (most recent call first):
  File "/usr/local/lib/python3.6/selectors.py", line 376 in select
  File "/usr/local/lib/python3.6/multiprocessing/connection.py", line 911 in wait
  File "/home/circleci/project/.nox/test_plugins-3-6-pip-install/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 615 in _queue_management_worker
  File "/usr/local/lib/python3.6/threading.py", line 864 in run
  File "/usr/local/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/local/lib/python3.6/threading.py", line 884 in _bootstrap

The segfault involves the loky backend, namely the files:

  • joblib/externals/loky/backend/queues.py (L138)
  • joblib/externals/loky/process_executor.py (L615)

Loky is the default backend used by Joblib.Parallel. It spawns workers in different processes and uses threading for managing queues. Since loky is involved in the segfault, disabling the test that uses the threading backend (workers in threads), probably won't help as a fix.

Unfortunately, I cannot reproduce the segfault on my linux machine. Can we identify which of the tests in LauncherTestSuite (or, perhaps, IntegrationTestSuite) is the one failing exactly? Could then try to set up a minimal isolated example running on CircleCI.

@jan-matthis
Copy link
Contributor

jan-matthis commented Feb 16, 2020

It's difficult to reproduce the error even on CircleCI.

I tried the following steps:

  1. Forked the repository, enabled CircleCI builds, waited for a py36_linux run to finish (was a successful one)
  2. Re-ran with SSH enabled, and connected to the virtual machine (see Debugging with SSH)
  3. In project/:
for i in `seq 100`; do SILENT=0 nox -s "test_plugins-3.6(pip install -e)"; done > nox_out.log 2>&1

Note that SILENT=0 enables verbose output for pytest, with the hope to be able to better track which test case is failing exactly when failure occurs.

  1. Opened a second SSH session, in project/:
watch -n 5 "cat nox_out.log | grep failed";

This ran for over an hour (25+ iterations) and unfortunately did not yield any failures. I already reached more than 60% of my CircleCI free-plan budget (tried other things before).

Any suggestions what to do differently?

@omry
Copy link
Collaborator Author

omry commented Feb 16, 2020

I was thinking to use ssh, but maybe this is more likely to happen the first time only.

You can change the noxfile to control what is getting executed on circle. (for example, change the default to verbose and disable jobs other that the one with the issue).
Look at the top of nox file, you need to use VERBOSE=1, not SILENT=0.

Re: circleci
I will reach out to them to ask if they see a solution here.

@omry
Copy link
Collaborator Author

omry commented Feb 16, 2020

@ogrisel, @tomMoral:
We are still trying to narrow it down, but have you seen anything similar?

@omry
Copy link
Collaborator Author

omry commented Feb 17, 2020

About 20% of the top level runs are crashing. (not sure about the actual crash percentage, each one represents 8 runs across different operating systems any Python versions (actually 6 runs considering Windows is skipped right now).
One crash is on Linux + 3.6 and the other on Mac + 3.6.
I will disable 3.6 for this plugin and retest to see if things are stable at 3.7 and above.

Screenshot 2020-02-17 10 50 55

@omry
Copy link
Collaborator Author

omry commented Feb 17, 2020

15/15 passes with Python >= 3.7.
I am going to land the fixes but I am disappointed that joblib is not stable on Python 3.6.

Screenshot 2020-02-17 11 28 35

@tomMoral
Copy link

Hello!

I just saw your ping. This is a weird behavior indeed, especially the fact that it only happens in 3.6 and not 3.7...
I don't think this has to do directly with the lines that appear in the traceback as these two threads are only I/O threads waiting for communication with the main thread or the child processes. The segfault is probably either in one of the child process (but it is weird that it is detected in the main process) or in the main process.

I am interested in understanding why this is happening so we can fix this in loky. If you have a bit of time to help me delve in the code, that would be very helpful. One thing that I don't understand is how this segfault is detected. Could you point me out the part of the code that prints Fatal Python error: Segmentation fault and call dump_traceback?

@omry
Copy link
Collaborator Author

omry commented Feb 18, 2020

Hi Tom, thanks for your comment.
I'd be happy to work with you on trying to get to the bottom of this.

General thoughts (not something that I have seen here specifically):
I have seen segfault in waits happen when the mutex itself went out of scope, this is typically happening if the process is exiting while the thread is still waiting.

  • Do you have any cleanup logic that tries to unblock the waiting threads on exit?
  • Maybe there is a race condition in it, or maybe it's not always triggering for some exit conditions?

At the moment this is rather hard to reproduce.
I did see Access violation while testing on Windows which I suspect is the same issue (for now we also disabled Windows as a supported platform for this plugin, maybe once you fix this we can re-enable).

Hydra is pure Python, at the moment it's code is not printing or dumping the stack trace.
From the error it looks to me like this is happening in Python by default (maybe for all threads).
You can try to trigger a segmentation fault in your thread somehow and see if it's printed.
(with C it would be as easy as

char *c = 0
*c = 1

not sure how to do that from Python.

This is probably better done interactively. can you join the Hydra chat? (I may be a bit sporadic today and tomorrow).

@tomMoral
Copy link

tomMoral commented Feb 19, 2020

* Do you have any cleanup logic that tries to unblock the waiting threads on exit?

Yes there are some clean up logic. But as the threads have references on the waited mutex, I don't think this is the case where the mutex got cleaned up. In addition, the traceback show that the threads are still in the correct function so I doubt this is th esource of the segfault.

* Maybe there is a race condition in it, or maybe it's not always triggering for some exit conditions?

Yes, as it does not happen deterministically, it is most probably caused by some kind of race condition, something cleaned up at the wrong time. I think the most probable would be to look around the memap, that might be cleaned up concurrently (but it should not..)

I did see Access violation while testing on Windows which I suspect is the same issue.

Do you have more info on this?

Hydra is pure Python, at the moment it's code is not printing or dumping the stack trace.
From the error it looks to me like this is happening in Python by default (maybe for all threads).
You can try to trigger a segmentation fault in your thread somehow and see if it's printed.
(with C it would be as easy as

char *c = 0
*c = 1

In pure Python, you can cause a segfault using faulthandler

import faulthandler
faulthandler._sigsegv()

This does not dump the traceback of print such message in my Python3.7 console. The traceback is something that comes from faulthandler.dump_traceback so I am curious to see where it is called, to better understand what we are seeing.
My bad, it comes from python indeed: see https://docs.python.org/3/library/faulthandler.html#example.

@tomMoral
Copy link

After playing a bit with this, the segfault seems to happen in the main process and most probably in the main thread. I would need to know which test fail here.

Isn't it possible to re run the test suit with pytest using the -v option so we can see the name of the test when the segfault happen?

@ogrisel
Copy link

ogrisel commented Feb 19, 2020

I confirm that I could not reproduce locally either. I ran the tests for the plugins a couple of times with:

$ nox -k plugins -- -v

to enable the verbose mode in pytest. However I am not sure how to tel nox / pytest to launch the joblib plugin test only.

If I run:

$ pip install .
$ pytest -vl plugins/hydra_joblib_launcher/

then I get failures such as:

hydra.errors.MissingConfigException: Could not load hydra/launcher/joblib.yaml, available options:
hydra/launcher:
	basic

@omry
Copy link
Collaborator Author

omry commented Feb 19, 2020

Thanks @tomMoral.
For your question about Windows, I don't think it's a different issue. I have seen it several times while debugging a different problem on Windows, but it's not any easier to reproduce.
Most of the repros were in CircleCI (this makes sense, each push is triggering 4 to 6 runs. if one fails the whole push fails CI).
I think it's happening at around 1% or 2% of the test runs.
I was not yet able to reproduce it locally on demand.

Hi @ogrisel!
Please join the Hydra chat, we can have a group discussion which will be more interactive there.

Answering your questions:
To test just this plugin, you can use PLUGINS=hydra_joblib_launcher nox -s "test_plugins-3.6(pip install)"

This will still run the Hydra core tests though (which can be a bit slow).
If you run it, you will probably see:

$ PLUGINS=hydra_joblib_launcher nox -s "test_plugins-3.6(pip install)"
nox > Running session test_plugins-3.6(pip install)
...
nox > Not testing hydra_joblib_launcher on Python 3.6, supports [3.7,3.8]

The noxfile is using the setup.py of the plugin to determine what to test it on.
you can fix it by adding Python 3.6 to the setup.py of the plugin (plugins/hydra_joblib_launcher/setup.py).

By the way, feel free to hack the noxfile if it helps you debug things. (as long as we don't land it, do whatever you need).

To run the tests directly, you need to install the plugin first.
try pip install -e plugins/hydra_joblib_launcher.
After that, you can run the tests directly: pytest plugins/hydra_joblib_launcher/

Unrelated to the joblib issue on 3.6, I ran into a new issue in a branch I will land now that will give me another reason to block 3.6 for this plugin.
This is probably a week or so from landing. If you start seeing errors about:

_pickle.PicklingError: Can't pickle typing.Union[str, NoneType]: it's not the same object as typing.Union

With 3.6 it means its my recent changes. This is a cloudpickle issue with serializing annotated types.
If this happens, I will tell you have to make the plugin compatible with 3.6 again (probably by creating a temporary branch that is compatible with 3.6).

@omry
Copy link
Collaborator Author

omry commented Feb 19, 2020

@ogrisel, I didn't realize you are one of the maintainers of cloudpickle! this is awesome :).
You might be able to help with this issue I am talking about. I will have more information when I land something in about a week.

I said it's a cloudpickle issue but really I didn't dig in. It could also be something else.

@ogrisel
Copy link

ogrisel commented Feb 20, 2020

Unrelated to the joblib issue on 3.6, I ran into a new issue in a branch I will land now that will give me another reason to block 3.6 for this plugin.
This is probably a week or so from landing. If you start seeing errors about:
_pickle.PicklingError: Can't pickle typing.Union[str, NoneType]: it's not the same object as typing.Union

This might be related to this bug that was just fixed in master by @pierreglaser: cloudpipe/cloudpickle#347

This is about silently dropping type annotations on dynamic classes in Python 3.6. If you really need to pickle type annotations on dynamic functions or classes (without silent drop), then there is this PR but it's quite complex and I don't have time to review it at the moment.

@omry
Copy link
Collaborator Author

omry commented May 13, 2020

Closing as we have a workaround (officially supporting only 3.7+ with the joblib launcher plugin).
This is an acceptable workaround for now, if people insist on using it with 3.6 I can reopen.

@omry omry closed this as completed May 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants