You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an optimization we identified while looking into q23a/b, q24a/b and several other TPCDS queries. Currently, the shuffled hash join code follows these steps in order to materialize build and stream side when it first starts working a partition:
It fetches the build side, concatenates it on the host, grabs the semaphore and puts it on the GPU
It fetches the first stream side batch, concatenates it on the host, and puts it on the GPU (while continuing to hold on to the semaphore)
performs the join
The observation in traces is is that grabbing the semaphore in (1) means we are holding onto the semaphore while the IO parts of the first stream batch are taking place in (2). This is CPU work and we should be able to do this outside of the semaphore, moreover it can be very much non-trivial amounts of time. In q23a/b there are several seconds spent in this mode.
A proof of concept was coded that does this instead:
It fetches the build side and concatenates it on the host.
The stream side is allowed to fetch that first batch, concatenates it on the host, acquires the semaphore, and puts the stream batch on the GPU.
The build side is allowed to go to the GPU.
performs the join
This has lead to savings of ~10s in q23a/b, q24a/b and others. Overall we see close to 2 minutes worth of time spent.
The complicated part about this change is that It adds more host memory pressure, since many tasks would be doing their IO and keeping a host-side batch. One approach to keep this in check is using the batch size goal as a limit, so if we reach the limit we'd start copying batch-sized batches to the GPU and grabbing the semaphore. I am still working through this part of it.
The text was updated successfully, but these errors were encountered:
This is an optimization we identified while looking into q23a/b, q24a/b and several other TPCDS queries. Currently, the shuffled hash join code follows these steps in order to materialize build and stream side when it first starts working a partition:
The observation in traces is is that grabbing the semaphore in (1) means we are holding onto the semaphore while the IO parts of the first stream batch are taking place in (2). This is CPU work and we should be able to do this outside of the semaphore, moreover it can be very much non-trivial amounts of time. In q23a/b there are several seconds spent in this mode.
A proof of concept was coded that does this instead:
This has lead to savings of ~10s in q23a/b, q24a/b and others. Overall we see close to 2 minutes worth of time spent.
The complicated part about this change is that It adds more host memory pressure, since many tasks would be doing their IO and keeping a host-side batch. One approach to keep this in check is using the batch size goal as a limit, so if we reach the limit we'd start copying batch-sized batches to the GPU and grabbing the semaphore. I am still working through this part of it.
The text was updated successfully, but these errors were encountered: