You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I've started testing with large scale factor data sets and I am seeing TPC-H query 3 hang, possibly during execution of query stage 3.
Here are the data sizes of the shuffle output directories for the different query stages at the time the query appears to have stopped executing.
Query stages 1, 2, and 4 have 48 shuffle files for each output partition, as expected. Query stage 3 only has 3 shuffle output files for each output partition, which doesn't seem right.
The last output I see in the scheduler process is:
INFO ballista_scheduler] Sending new task to 3965aec5-ca89-4853-90ee-91f56e23a979: RpXfVVN/3/12
Here is some output from one partition from query stage 3 that did complete (output partitions 2, 14, and 22 completed).
Additional context
Running on 24-core threadripper with 64 GB RAM.
Before the hang, things were looking good - cores were being kept relatively busy and overall system memory use was only 12 GB and stayed pretty flat throughout.
The text was updated successfully, but these errors were encountered:
This feels like a deadlock somewhere. I wonder if the shuffle reader is unable to read partitions because the executor has run out of threads to handle incoming Flight requests. I will add some debug logging and explore that next.
Describe the bug
I've started testing with large scale factor data sets and I am seeing TPC-H query 3 hang, possibly during execution of query stage 3.
Here are the data sizes of the shuffle output directories for the different query stages at the time the query appears to have stopped executing.
Query stages 1, 2, and 4 have 48 shuffle files for each output partition, as expected. Query stage 3 only has 3 shuffle output files for each output partition, which doesn't seem right.
The last output I see in the scheduler process is:
Here is some output from one partition from query stage 3 that did complete (output partitions 2, 14, and 22 completed).
To Reproduce
Generate data set using tpctools crate.
Run a scheduler:
Run an executor:
Run the benchmark:
Expected behavior
Query should complete.
Additional context
Running on 24-core threadripper with 64 GB RAM.
Before the hang, things were looking good - cores were being kept relatively busy and overall system memory use was only 12 GB and stayed pretty flat throughout.
The text was updated successfully, but these errors were encountered: