Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ballista: TPC-H q3 @ SF=1000 never completes #835

Closed
andygrove opened this issue Aug 7, 2021 · 2 comments
Closed

Ballista: TPC-H q3 @ SF=1000 never completes #835

andygrove opened this issue Aug 7, 2021 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Member

Describe the bug
I've started testing with large scale factor data sets and I am seeing TPC-H query 3 hang, possibly during execution of query stage 3.

Here are the data sizes of the shuffle output directories for the different query stages at the time the query appears to have stopped executing.

$ du -h -d 1 /mnt/bigdata/temp/RpXfVVN/
616M	/mnt/bigdata/temp/RpXfVVN/1
890M	/mnt/bigdata/temp/RpXfVVN/3
21G	/mnt/bigdata/temp/RpXfVVN/2
92G	/mnt/bigdata/temp/RpXfVVN/4
113G	/mnt/bigdata/temp/RpXfVVN/

Query stages 1, 2, and 4 have 48 shuffle files for each output partition, as expected. Query stage 3 only has 3 shuffle output files for each output partition, which doesn't seem right.

The last output I see in the scheduler process is:

INFO  ballista_scheduler] Sending new task to 3965aec5-ca89-4853-90ee-91f56e23a979: RpXfVVN/3/12

Here is some output from one partition from query stage 3 that did complete (output partitions 2, 14, and 22 completed).

=== [RpXfVVN/3/14] Physical plan with metrics ===
ShuffleWriterExec: Some(Hash([Column { name: "o_orderkey", index: 2 }], 24)), metrics=[outputRows=6072170, writeTime=1085450524, inputRows=6072170]
  CoalesceBatchesExec: target_batch_size=4096, metrics=[]
    HashJoinExec: mode=Partitioned, join_type=Inner, on=[(Column { name: "c_custkey", index: 0 }, Column { name: "o_custkey", index: 1 })], metrics=[outputBatches=7073, inputRows=30396981, inputBatches=7073, outputRows=6072170, joinTime=1458]
      CoalesceBatchesExec: target_batch_size=4096, metrics=[]
        ShuffleReaderExec: partition_locations(24)=...

To Reproduce

Generate data set using tpctools crate.

cargo install tpctools
tpctools generate --benchmark tpch --scale 1000 --partitions 48 --generator-path /mnt/bigdata/tpch-dbgen --output /mnt/bigdata/tpch-sf1000/

Run a scheduler:

RUST_LOG=info ./target/release/ballista-scheduler

Run an executor:

RUST_LOG=info ./target/release/ballista-executor -c 24 --work-dir /mnt/bigdata/temp

Run the benchmark:

../target/release/tpch benchmark ballista --path /mnt/bigdata/tpch-sf1000/ --format tbl --iterations 1 --query 3 --debug --host localhost --port 50050 --shuffle-partitions 24

Expected behavior
Query should complete.

Additional context
Running on 24-core threadripper with 64 GB RAM.

Before the hang, things were looking good - cores were being kept relatively busy and overall system memory use was only 12 GB and stayed pretty flat throughout.

ballista-tpch-sf1000

@andygrove andygrove added bug Something isn't working ballista labels Aug 7, 2021
@andygrove andygrove self-assigned this Aug 7, 2021
@andygrove
Copy link
Member Author

This feels like a deadlock somewhere. I wonder if the shuffle reader is unable to read partitions because the executor has run out of threads to handle incoming Flight requests. I will add some debug logging and explore that next.

@andygrove
Copy link
Member Author

I ran this again and it succeeded so maybe I was not being patient enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant