Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] explore faster data transitions #507

Open
revans2 opened this issue Aug 4, 2020 · 1 comment
Open

[FEA] explore faster data transitions #507

revans2 opened this issue Aug 4, 2020 · 1 comment
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf epic Issue that encompasses a significant feature or body of work performance A performance related task/issue

Comments

@revans2
Copy link
Collaborator

revans2 commented Aug 4, 2020

When working with some of the cache/persist operations it has become very clear that moving data from CPU to GPU and back is a real performance problem. From past experience trying to optimize shuffle, part of the problem come down to the number of buffers that need to be moved. This is something that is going to become more and more of a problem with nested data types. The rest of the problem has a lot to do with the actual data access pattern. Going from row to column and column to row inherently forces one of the operations to stride through memory. This is really bad for the CPU cache.

We have tried to write a custom kernel to translate GPU columnar data into spark's unsafe row format in the past, and it did help some, but the memory format is really wasteful and resulted in bad performance because we could not allocate enough GPU memory to make it worth while.

I personally would like to see us work with the cudf team to develop a packed row based format that we could translate to/from on the GPU. Doing a row based to row based translation is not that expensive for the CPU.

Tasks:

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Aug 4, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Aug 4, 2020
@sameerz sameerz added the performance A performance related task/issue label Dec 1, 2020
@sameerz sameerz added this to the Jan 4 - Jan 15 milestone Dec 18, 2020
@sameerz sameerz added epic Issue that encompasses a significant feature or body of work and removed feature request New feature or request labels Dec 18, 2020
@sameerz sameerz removed this from the Feb 1 - Feb 12 milestone Feb 12, 2021
@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021
@Salonijain27 Salonijain27 added this to the Nov 1 - Nov 12 milestone Oct 31, 2021
@hyperbolic2346 hyperbolic2346 removed this from the Nov 15 - Nov 26 milestone Nov 12, 2021
@sameerz sameerz added this to the Nov 15 - Nov 26 milestone Nov 14, 2021
@sameerz sameerz removed this from the Jan 10 - Jan 28 milestone Jan 30, 2022
@sameerz
Copy link
Collaborator

sameerz commented Jan 30, 2022

Removing this from the sprint milestones as this is an overarching feature with sub-tasks for each sprint.

@revans2 revans2 removed their assignment Feb 1, 2022
@mattahrens mattahrens added the P0 Must have for release label Apr 27, 2022
pxLi added a commit to pxLi/spark-rapids that referenced this issue May 12, 2022
@mattahrens mattahrens removed the P0 Must have for release label Aug 7, 2023
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: spark-rapids automation <[email protected]>

Signed-off-by: spark-rapids automation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf epic Issue that encompasses a significant feature or body of work performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

5 participants