Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize semaphore acquisition in GpuShuffledHashJoinExec #4588

Merged
merged 14 commits into from
Feb 8, 2022

Conversation

abellina
Copy link
Collaborator

Closes #4539.

This PR adds an optimization for the shuffled hash join where the semaphore is not taken after the build side is fetched to the host (since we attempt to keep the build side on the host while we load the first batch from the stream side).

After the first stream batch is loaded, the semaphore has been acquired. We then proceed to bring the build batch to the GPU.

This code falls back to the old behavior in cases where batches are not serialized (our input is not a shuffle) and when the streams are empty.

It is in draft mode because I need to run more testing in NDS. I don't think the case where I need to loop over the build batch multiple times has been hit with my runs so far (other than unit tests), so I need to figure out a way to trigger that case.

@abellina abellina added the performance A performance related task/issue label Jan 20, 2022
@abellina abellina added this to the Jan 10 - Jan 28 milestone Jan 20, 2022
@revans2
Copy link
Collaborator

revans2 commented Jan 21, 2022

So just to be clear this optimization can never work with UCX right?

@abellina
Copy link
Collaborator Author

So just to be clear this optimization can never work with UCX right?

Great point. I have missed a case where we need to ignore the config when UCX is turned on and we have compression on. If we don't do this, the fallback case right now doesn't handle compressed batches from the shuffle.

/**
* Removes `GpuCoalesceBatches(GpuShuffleCoalesceExec(build side))` for the build side
* for the shuffled hash join. The coalesce logic has been moved to the
* `GpuShuffleCoalesceExec` class, and is handled differently to prevent holding onto the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, but I really would prefer to have us just build the plan the right way from the beginning. Perhaps we can have a flag in GpuExec that says for this input I know how to handle raw shuffle data so then the rules that insert the host side coalesce can deal with it there.

@abellina
Copy link
Collaborator Author

I added a commit that I think addresses many of the points, enough has changed that I think this needs another pass.

One comment from Jason wasn't addressed for sure:

Do we really need a buffered iterator on the stream-side batch? We just need to call hasNext which will grab the semaphore but I don't see the need for us to call next on the stream-side iterator. That also side-steps another bug here where we create a closeable buffered iterator but pass it to code that will never call close on it since it's just a regular iterator to that code.

We need something to hold on to that first stream batch. That doJoin pays no attention to this, so if there is an exception before that the first batch is popped, then the batch will be leaked.

I think one ugly option is to attach the closeable iterator to the task completion logic. I think we discussed doing this, but I am confirming that's the approach or if there's something better.

@abellina abellina marked this pull request as ready for review February 2, 2022 16:22
jlowe
jlowe previously approved these changes Feb 2, 2022
@abellina
Copy link
Collaborator Author

abellina commented Feb 2, 2022

build

@abellina
Copy link
Collaborator Author

abellina commented Feb 3, 2022

Looking into the test failures.

@abellina
Copy link
Collaborator Author

abellina commented Feb 3, 2022

build

@abellina
Copy link
Collaborator Author

abellina commented Feb 4, 2022

@jlowe @revans2 could you take another look when you have a chance?

jlowe
jlowe previously approved these changes Feb 4, 2022
@@ -448,9 +458,84 @@ class GpuCoalesceIterator(iter: Iterator[ColumnarBatch],
batches.append(SpillableColumnarBatch(batch, SpillPriorities.ACTIVE_BATCHING_PRIORITY,
spillCallback))

protected def popAll(): Array[ColumnarBatch] = {
closeOnExcept(batches.map(_.getColumnarBatch())) { wip =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does this need to be a safeMap?

@abellina
Copy link
Collaborator Author

abellina commented Feb 4, 2022

@revans2 I replied to your comment just now in the original request, but adding a link here so we can continue if there's more I need to look into:

Still would like to know the answer to this. The things in the iterator need to be closed and can potentially be very large.

#4588 (comment)

@abellina
Copy link
Collaborator Author

abellina commented Feb 5, 2022

build

revans2
revans2 previously approved these changes Feb 7, 2022
@abellina
Copy link
Collaborator Author

abellina commented Feb 7, 2022

Ok I have a bug with safeMap, I thought I had built this before I pushed it last, but obviously I didn't. I am going to retest the patch against real queries to be sure.

@abellina
Copy link
Collaborator Author

abellina commented Feb 7, 2022

build

@abellina
Copy link
Collaborator Author

abellina commented Feb 7, 2022

@jlowe @revans2 ok apologies. Now this should be ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] semaphore optimization in shuffled hash join
4 participants