[FEA] semaphore optimization in shuffled hash join #4539

abellina · 2022-01-14T23:15:48Z

This is an optimization we identified while looking into q23a/b, q24a/b and several other TPCDS queries. Currently, the shuffled hash join code follows these steps in order to materialize build and stream side when it first starts working a partition:

It fetches the build side, concatenates it on the host, grabs the semaphore and puts it on the GPU
It fetches the first stream side batch, concatenates it on the host, and puts it on the GPU (while continuing to hold on to the semaphore)
performs the join

The observation in traces is is that grabbing the semaphore in (1) means we are holding onto the semaphore while the IO parts of the first stream batch are taking place in (2). This is CPU work and we should be able to do this outside of the semaphore, moreover it can be very much non-trivial amounts of time. In q23a/b there are several seconds spent in this mode.

A proof of concept was coded that does this instead:

It fetches the build side and concatenates it on the host.
The stream side is allowed to fetch that first batch, concatenates it on the host, acquires the semaphore, and puts the stream batch on the GPU.
The build side is allowed to go to the GPU.
performs the join

This has lead to savings of ~10s in q23a/b, q24a/b and others. Overall we see close to 2 minutes worth of time spent.

The complicated part about this change is that It adds more host memory pressure, since many tasks would be doing their IO and keeping a host-side batch. One approach to keep this in check is using the batch size goal as a limit, so if we reach the limit we'd start copying batch-sized batches to the GPU and grabbing the semaphore. I am still working through this part of it.

abellina added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Jan 14, 2022

abellina self-assigned this Jan 14, 2022

abellina added this to the Jan 10 - Jan 28 milestone Jan 14, 2022

jlowe mentioned this issue Jan 15, 2022

[FEA] move some deserialization code out of the scope of the gpu-semaphore to increase cpu concurrent #679

Closed

sameerz removed feature request New feature or request ? - Needs Triage Need team to review and classify labels Jan 18, 2022

abellina mentioned this issue Jan 20, 2022

Optimize semaphore acquisition in GpuShuffledHashJoinExec #4588

Merged

sameerz modified the milestones: Jan 10 - Jan 28, Jan 31 - Feb 11 Jan 30, 2022

abellina closed this as completed in #4588 Feb 8, 2022

abellina mentioned this issue Oct 5, 2022

[BUG] Regression in NDSv2 of 4% because of spillable broadcast #6708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] semaphore optimization in shuffled hash join #4539

[FEA] semaphore optimization in shuffled hash join #4539

abellina commented Jan 14, 2022

[FEA] semaphore optimization in shuffled hash join #4539

[FEA] semaphore optimization in shuffled hash join #4539

Comments

abellina commented Jan 14, 2022