Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST]There are two questions about TPCxBB Like query results in README.md #1212

Closed
YeahNew opened this issue Nov 27, 2020 · 2 comments
Closed
Labels
question Further information is requested

Comments

@YeahNew
Copy link

YeahNew commented Nov 27, 2020

What is your question?
I used the spark-rapids plugin to successfully use the GPU through spark3.0. Re-do TPCxBB Like query results as shown in README.in. We use a 200GB Dataset (scale factor 200), stored in csv. The processing happened on a three node cluster. Each node has 64 CPU cores, 365GB host memory, 6 Tesla T4 GPUs, and 16 GB GPU memory.
I set total-executors=6 unchanged, and changed total-executor-cores(6,12,24,48,72,96,128). In this way, 6 GPUs are used (because each executor can only be bound to one GPU), and I found that these querys were also accelerated in GPU mode. But I found some problems with iostat -x 1> io.log and nload eno4.
For Query#5, in GPU mode, the IO and network bandwidth will reach the maximum (the so-called bottleneck) and last a long time (Maximum speed of IO read or write = 500MB/s. Maximum network bandwidth speed = 1000MB/s). The more cpu cores, the longer the duration of the bottleneck. When cpu ocres>48, the duration remains almost unchanged. However, in pure CPU mode, when cpu cores>96, IO and network bandwidth will become bottlenecks.
Then I reduced the executors to 3, total-executor-cores=6. I want to eliminate the impact of network and IO bottlenecks, and simply compare CPU and GPU computing power. In pure CPU mode, all Query(5,16,21,22) have almost no bottlenecks. In GPU mode, Query#16 and Query#21 still have short-term bottlenecks.
When we experimented, the memory is sufficient, so I have two questions:

  1. Do your results exclude IO and network bottlenecks, but simply compare the computational efficiency of CPU and GPU modes?
  2. Tpcx-BB has a total of 30 queries, why choose Query [FEA] Support Adaptive Execution #5,[FEA] Support for arrays and array operations #16,[FEA] NormalizeNansAndZeros #21,[FEA] Add support for bucketed writes #22. Is there anything special about these four Querys? Is there any difference between them and tpc-ds?
@YeahNew YeahNew added ? - Needs Triage Need team to review and classify question Further information is requested labels Nov 27, 2020
@abellina
Copy link
Collaborator

  1. Do your results exclude IO and network bottlenecks, but simply compare the computational efficiency of CPU and GPU modes?

Our results in the README include IO and network bottlenecks, it is the wall-clock time. These are 2 DGX-2 machines, and so they have very fast NVLink, each GPU is sharing a 100Gb/sec connection using RoCE with 1 other GPU. Additionally the files are stored locally as Parquet on the DGX-2 RAID. The shuffle was done using UCX in this scenario, and so we used both NVLink and RoCE to send the shuffle data, so to some degree we are minimizing the IO/Network bottleneck and trying to feed the GPUs as much as we can. We ran at 10TB and had generously sized partitions (we had 32 shuffle partitions most of the time). We were not using the 200 partition default in Spark.

  1. Tpcx-BB has a total of 30 queries, why choose Query [FEA] Support Adaptive Execution #5,[FEA] Support for arrays and array operations #16,[FEA] NormalizeNansAndZeros #21,[FEA] Add support for bucketed writes #22. Is there anything special about these four Querys? Is there any difference between them and tpc-ds?

We chose the queries due to shuffle size and support on the GPU (at the time). We could show a more accurate picture of where we are heading by choosing queries that did do a lot of shuffle, to stress the IO components we are developing, and also with most of the query covered on the GPU we can show what the plugin can do as we add more coverage. Our coverage has improved significantly since then, and so we are continuously looking at other benchmarks (like TPC-DS), so I don't think these 4 queries are the end game in any way, just queries we could use to showcase our plugin.

The more cpu cores, the longer the duration of the bottleneck. When cpu ocres>48, the duration remains almost unchanged. However, in pure CPU mode, when cpu cores>96, IO and network bandwidth will become bottlenecks.

I am a bit surprised by this. More cores should mean more scheduling opportunities, so your tasks should see CPU time quicker, and not be queued in waves for a restrictive environment. That said, it could be you are over-committing the CPU, and now you are seeing a negative effects. It seems some information on what got slower would help. Did you see the shuffle reads taking longer, for example?

Then I reduced the executors to 3, total-executor-cores=6. I want to eliminate the impact of network and IO bottlenecks, and simply compare CPU and GPU computing power. In pure CPU mode, all Query(5,16,21,22) have almost no bottlenecks. In GPU mode, Query#16 and Query#21 still have short-term bottlenecks.

Does this say: with less executor-cores, IO stopped being as much of an overhead? One potential thing that could be happening here is more pipelining of IO and compute, since there are less shuffle iterators trying to fetch (this goes along the previous comment on cores going above 96 having negative effects). The Spark stage has a "Timeline" view that may help show what tasks were doing in time. But I have the same question as before. What part of the query got faster?

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Dec 1, 2020
@sameerz
Copy link
Collaborator

sameerz commented Feb 9, 2021

Closing, please reopen if you still have questions.

@sameerz sameerz closed this as completed Feb 9, 2021
@NVIDIA NVIDIA locked and limited conversation to collaborators Apr 28, 2022
@sameerz sameerz converted this issue into discussion #5374 Apr 28, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants