Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] semaphore optimization in shuffled hash join #4539

Closed
abellina opened this issue Jan 14, 2022 · 0 comments · Fixed by #4588
Closed

[FEA] semaphore optimization in shuffled hash join #4539

abellina opened this issue Jan 14, 2022 · 0 comments · Fixed by #4588
Assignees
Labels
performance A performance related task/issue

Comments

@abellina
Copy link
Collaborator

This is an optimization we identified while looking into q23a/b, q24a/b and several other TPCDS queries. Currently, the shuffled hash join code follows these steps in order to materialize build and stream side when it first starts working a partition:

  1. It fetches the build side, concatenates it on the host, grabs the semaphore and puts it on the GPU
  2. It fetches the first stream side batch, concatenates it on the host, and puts it on the GPU (while continuing to hold on to the semaphore)
  3. performs the join

The observation in traces is is that grabbing the semaphore in (1) means we are holding onto the semaphore while the IO parts of the first stream batch are taking place in (2). This is CPU work and we should be able to do this outside of the semaphore, moreover it can be very much non-trivial amounts of time. In q23a/b there are several seconds spent in this mode.

A proof of concept was coded that does this instead:

  1. It fetches the build side and concatenates it on the host.
  2. The stream side is allowed to fetch that first batch, concatenates it on the host, acquires the semaphore, and puts the stream batch on the GPU.
  3. The build side is allowed to go to the GPU.
  4. performs the join

This has lead to savings of ~10s in q23a/b, q24a/b and others. Overall we see close to 2 minutes worth of time spent.

The complicated part about this change is that It adds more host memory pressure, since many tasks would be doing their IO and keeping a host-side batch. One approach to keep this in check is using the batch size goal as a limit, so if we reach the limit we'd start copying batch-sized batches to the GPU and grabbing the semaphore. I am still working through this part of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants