-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in TInterpreter::Calc
with no output stack in seemingly random distributed rdf test execution
#11515
Comments
TInterpreter::Calc
with no output stack in seemingly random distributed rdf test execution
The trace of this thread looks quite similar to the one at #8365 (comment) but probably it's just an unrelated issue |
For completeness, what are the complete stack trace for thread 1 and 21? |
Complete stacktrace of thread 1Complete stacktrace of thread 21 |
Can you also instrument |
Most likely the problem is that |
I converted instances of
This did something actually, now I quite consistently always get a segfault, seemingly triggered by Python, with this stacktrace
|
Any chance to run the failing process in valgrind? |
I don't have any power on the process that fail itself, it's a task run inside of the On other news, I made a few more tests focusing on the version of dask First, I noticed this type of error, happening between a test and another
Initially I thought these errors were just a by-product of the errors coming from So I went on and I used I was able to identify the following scenarios, depending on a combination of the status of ROOT and the version of dask
So it seems that a mix of adding the fix suggested by Philippe (a More details about the
|
Fixes root-project#11515 This method leads to contention in some specific scenarios (see linked issue). Co-authored-by: Philippe Canal <[email protected]>
Fixes #11515 This method leads to contention in some specific scenarios (see linked issue). Co-authored-by: Philippe Canal <[email protected]>
Hi @pcanal, @Axel-Naumann, @vgvassilev, @vepadulano, @jalopezg-r00t, It appears this issue is closed, but wasn't yet added to a project. Please add upcoming versions that will include the fix, or 'not applicable' otherwise. Sincerely, |
There's a second part to fixing this issue for good, i.e. also making sure we don't see weird crashes/segfaults due to |
This is needed to avoid major bugs in dask, see root-project#11515 for details.
This is needed to avoid major bugs in dask, see #11515 for details.
Fixes root-project#11515 This method leads to contention in some specific scenarios (see linked issue). Co-authored-by: Philippe Canal <[email protected]>
Fixes root-project#11515 This method leads to contention in some specific scenarios (see linked issue). Co-authored-by: Philippe Canal <[email protected]>
Fixes root-project#11515 This method leads to contention in some specific scenarios (see linked issue). Co-authored-by: Philippe Canal <[email protected]>
Fixes #11515 This method leads to contention in some specific scenarios (see linked issue). Co-authored-by: Philippe Canal <[email protected]>
Fixes #11515 This method leads to contention in some specific scenarios (see linked issue). Co-authored-by: Philippe Canal <[email protected]>
Fixes #11515 This method leads to contention in some specific scenarios (see linked issue). Co-authored-by: Philippe Canal <[email protected]>
The problem
Sometimes the distributed RDataFrame test of the
RunGraphs
functionality fails, for example in this CI run. The visible error from the user side comes from Dask, which serializes a Python RuntimeError and produces some output log on stdout as follows:The Python
RuntimeError
in turn is just a wrapper of the C++std::runtime_error
which comes from this function in the RDF machinery.That error suggests a failure in
TInterpreter::Calc
, which should be visible with some kind of output to stderr showing the compilation error. Unfortunately, stderr is completely empty, no error is shown byTInterpreter
.By adding a bunch of print statements here and there, we can get to see some more details of this issue (the patch will be attached).
Adding these two lines in
cling::IncrementalExecutor::jitInitOrWrapper
provides a way to stop the execution when the offending code is triggered, so that we can step in withgdb -p PID_OF_FAILING_DISTRDF_TASK
:It shows that there are many threads in flight (
21
!), even if the distributed task from the RDF point of view runs sequentially:The two most interesting ones are threads
1
and21
:Thread1
Thread 21:
Which as a first instinct hints at some possible contention between the different things cling is doing in the two threads. For the moment I cannot come up with an easier reproducer, see next section for more details
How to reproduce
Here is the patch to print the statements that show the PID (as .txt so that I can attach it to this issue):
0001-Print-statements-for-DistRDF-Cling-failure.txt
The following is a Python script with the test
And here is a bash script that runs the previous Python script in a loop until it fails
Setup
ROOT master@b13756d
Python 3.10.7
Java 19
pyspark 3.3.0
dask/distributed 2022.7.1
ROOT built with:
Additional context
I am tentatively assigning this also to @Axel-Naumann @vgvassilev @jalopezg-r00t who may have an easier time at understanding the output from gdb and help in creating a simpler reproducer
The text was updated successfully, but these errors were encountered: