Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel copy for pinned memory #15934
Kernel copy for pinned memory #15934
Changes from 66 commits
eb39019
24b1245
6c896f6
0048c59
1964523
f871ca0
ac0ce9c
b610ba3
ab36162
69a1bce
83f665a
659cabc
c1ae478
b1a1582
707dfc7
1c09d0c
c343c31
25ddc4f
3fc988b
50f4d3e
8dfbd07
e429840
e5af490
9082ccc
054a98a
17b1ee0
e3c344b
ea6408f
2dbb68f
cb9cc22
cf67a14
24c1549
9c97833
075deca
164fce2
b566bab
21edb53
3814797
168609d
3ef149d
c933157
6784e07
a49789c
ba06fbd
f7999aa
c9a82d0
930efef
0466949
f312219
7cfee0a
fe4d668
52f4a96
4c2b7cf
e2c8613
5a71f77
9068642
2ec4670
a886eb4
0dae691
dd1fba8
59ed0dd
c6ef5f1
dcaeaba
e75808c
d50f145
0a2742f
68a03f1
b63b393
336c7e0
1741037
fff667b
da2c009
1bbd574
692f775
ce58c46
d897984
49d65b8
84a1797
0b2aa13
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we don't care for anything here since I expect that this will stay internal, but user-facing enums we usually provide a storage class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy_kind
seems somewhat generic, like something that could be in cudf/copying.hpp. Should we be more explicit with something likememcopy_kind
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. It's equivalent to cudaMemcpyKind, so this naming matches better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to reflect that only host memory type is specified now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is
copy_kind
needed at all? There is exactly one case (pinned, size less than threshold) where you do anything other than pass through tocudaMemcpyAsync
. You can detect that case withcudaPtrGetAttributes
and call Thrust for that one case, and just callcudaMemcpyAsync(cudaMemcpyDefault)
for everything else.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that we'll eventually have a separate threshold for pageable copies, where we copy to a pinned buffer and then
thrust::copy
. @abellina had this in the POC implementation, and IIRC it was helpful even with the extra copy.I understand current implementation is just a wrapper, I just wanted to leave room for more complex behavior without future changes to the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see. Does direction affect the choice at all? Could reduce 4 to 2 cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduces to two cases; only the host memory type is specified now.
I can also add an AUTO/DEFAULT option that would call
cudaPointerGetAttributes
. Let me know what you think.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want another name for this, given that it does not always call
cudaMemcpyAsync
? Proposing:cudf_memcpy_async
.(Happy to go either way on this, the status quo is fine.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like to include cudf in the name when it's already in the
cudf
namespace. Named it this way to make it obvious that it replaces the use ofcudaMemcpyAsync
. That said, I could probably be convinced to rename it, not tied to any specific name.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to agree, I don't like duplicating the namespace name in objects already within the namespace. That only encourages bad practices like using declarations to import the namespace members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any "magic" sizes where we expect one strategy to outperform the other? (A page size, a multiple of 1 kiB or similar) Or is this purely empirical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair to say that we don't know what the right value is for this (yet?). It's likely to be empirical, since the only goal is to avoid too many copies going through the copy engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s do a sweep over threshold values for the next steps where we enable this more broadly. I would like something closer to a microbenchmark (copy back and forth for different sizes with different thresholds?) than the multithreaded Parquet benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the code is pretty repetitive/pointless. Implementation is meant to leave room for more complex behavior without changes to the API in cuda_memcpy.hpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the purpose of this transparent passthrough just to have a function name that clearly indicates the direction of the transfer? You still have to get the src/dst order correct, though, so does that really help much? It seems duplicative, especially for something in an anonymous namespace inside detail that you're only using internally.
Same for pageable below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason was that I wanted to allow different behavior for h2d and d2h without changing the header. But now that the entire implementation is in the source file we can simplify this and separate the implementations only when we actually need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I really think you only need one function, no dispatch.