[QST]There are two questions about TPCxBB Like query results in README.md #1212

YeahNew · 2020-11-27T08:22:14Z

What is your question?
I used the spark-rapids plugin to successfully use the GPU through spark3.0. Re-do TPCxBB Like query results as shown in README.in. We use a 200GB Dataset (scale factor 200), stored in csv. The processing happened on a three node cluster. Each node has 64 CPU cores, 365GB host memory, 6 Tesla T4 GPUs, and 16 GB GPU memory.
I set total-executors=6 unchanged, and changed total-executor-cores(6,12,24,48,72,96,128). In this way, 6 GPUs are used (because each executor can only be bound to one GPU), and I found that these querys were also accelerated in GPU mode. But I found some problems with iostat -x 1> io.log and nload eno4.
For Query#5, in GPU mode, the IO and network bandwidth will reach the maximum (the so-called bottleneck) and last a long time (Maximum speed of IO read or write = 500MB/s. Maximum network bandwidth speed = 1000MB/s). The more cpu cores, the longer the duration of the bottleneck. When cpu ocres>48, the duration remains almost unchanged. However, in pure CPU mode, when cpu cores>96, IO and network bandwidth will become bottlenecks.
Then I reduced the executors to 3, total-executor-cores=6. I want to eliminate the impact of network and IO bottlenecks, and simply compare CPU and GPU computing power. In pure CPU mode, all Query(5,16,21,22) have almost no bottlenecks. In GPU mode, Query#16 and Query#21 still have short-term bottlenecks.
When we experimented, the memory is sufficient, so I have two questions:

Do your results exclude IO and network bottlenecks, but simply compare the computational efficiency of CPU and GPU modes?
Tpcx-BB has a total of 30 queries, why choose Query [FEA] Support Adaptive Execution #5,[FEA] Support for arrays and array operations #16,[FEA] NormalizeNansAndZeros #21,[FEA] Add support for bucketed writes #22. Is there anything special about these four Querys? Is there any difference between them and tpc-ds?

abellina · 2020-11-30T15:52:34Z

Do your results exclude IO and network bottlenecks, but simply compare the computational efficiency of CPU and GPU modes?

Our results in the README include IO and network bottlenecks, it is the wall-clock time. These are 2 DGX-2 machines, and so they have very fast NVLink, each GPU is sharing a 100Gb/sec connection using RoCE with 1 other GPU. Additionally the files are stored locally as Parquet on the DGX-2 RAID. The shuffle was done using UCX in this scenario, and so we used both NVLink and RoCE to send the shuffle data, so to some degree we are minimizing the IO/Network bottleneck and trying to feed the GPUs as much as we can. We ran at 10TB and had generously sized partitions (we had 32 shuffle partitions most of the time). We were not using the 200 partition default in Spark.

Tpcx-BB has a total of 30 queries, why choose Query [FEA] Support Adaptive Execution #5,[FEA] Support for arrays and array operations #16,[FEA] NormalizeNansAndZeros #21,[FEA] Add support for bucketed writes #22. Is there anything special about these four Querys? Is there any difference between them and tpc-ds?

We chose the queries due to shuffle size and support on the GPU (at the time). We could show a more accurate picture of where we are heading by choosing queries that did do a lot of shuffle, to stress the IO components we are developing, and also with most of the query covered on the GPU we can show what the plugin can do as we add more coverage. Our coverage has improved significantly since then, and so we are continuously looking at other benchmarks (like TPC-DS), so I don't think these 4 queries are the end game in any way, just queries we could use to showcase our plugin.

The more cpu cores, the longer the duration of the bottleneck. When cpu ocres>48, the duration remains almost unchanged. However, in pure CPU mode, when cpu cores>96, IO and network bandwidth will become bottlenecks.

I am a bit surprised by this. More cores should mean more scheduling opportunities, so your tasks should see CPU time quicker, and not be queued in waves for a restrictive environment. That said, it could be you are over-committing the CPU, and now you are seeing a negative effects. It seems some information on what got slower would help. Did you see the shuffle reads taking longer, for example?

Then I reduced the executors to 3, total-executor-cores=6. I want to eliminate the impact of network and IO bottlenecks, and simply compare CPU and GPU computing power. In pure CPU mode, all Query(5,16,21,22) have almost no bottlenecks. In GPU mode, Query#16 and Query#21 still have short-term bottlenecks.

Does this say: with less executor-cores, IO stopped being as much of an overhead? One potential thing that could be happening here is more pipelining of IO and compute, since there are less shuffle iterators trying to fetch (this goes along the previous comment on cores going above 96 having negative effects). The Spark stage has a "Timeline" view that may help show what tasks were doing in time. But I have the same question as before. What part of the query got faster?

sameerz · 2021-02-09T23:22:37Z

Closing, please reopen if you still have questions.

YeahNew added ? - Needs Triage Need team to review and classify question Further information is requested labels Nov 27, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Dec 1, 2020

sameerz closed this as completed Feb 9, 2021

NVIDIA locked and limited conversation to collaborators Apr 28, 2022

sameerz converted this issue into discussion #5374 Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

[QST]There are two questions about TPCxBB Like query results in README.md #1212

[QST]There are two questions about TPCxBB Like query results in README.md #1212

YeahNew commented Nov 27, 2020 •

edited

Loading

abellina commented Nov 30, 2020

sameerz commented Feb 9, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

[QST]There are two questions about TPCxBB Like query results in README.md #1212

[QST]There are two questions about TPCxBB Like query results in README.md #1212

Comments

YeahNew commented Nov 27, 2020 • edited Loading

abellina commented Nov 30, 2020

sameerz commented Feb 9, 2021

This issue was moved to a discussion.

YeahNew commented Nov 27, 2020 •

edited

Loading