-
Notifications
You must be signed in to change notification settings - Fork 240
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST]There are two questions about TPCxBB Like query results in README.md #1212
Comments
Our results in the README include IO and network bottlenecks, it is the wall-clock time. These are 2 DGX-2 machines, and so they have very fast NVLink, each GPU is sharing a 100Gb/sec connection using RoCE with 1 other GPU. Additionally the files are stored locally as Parquet on the DGX-2 RAID. The shuffle was done using UCX in this scenario, and so we used both NVLink and RoCE to send the shuffle data, so to some degree we are minimizing the IO/Network bottleneck and trying to feed the GPUs as much as we can. We ran at 10TB and had generously sized partitions (we had 32 shuffle partitions most of the time). We were not using the 200 partition default in Spark.
We chose the queries due to shuffle size and support on the GPU (at the time). We could show a more accurate picture of where we are heading by choosing queries that did do a lot of shuffle, to stress the IO components we are developing, and also with most of the query covered on the GPU we can show what the plugin can do as we add more coverage. Our coverage has improved significantly since then, and so we are continuously looking at other benchmarks (like TPC-DS), so I don't think these 4 queries are the end game in any way, just queries we could use to showcase our plugin.
I am a bit surprised by this. More cores should mean more scheduling opportunities, so your tasks should see CPU time quicker, and not be queued in waves for a restrictive environment. That said, it could be you are over-committing the CPU, and now you are seeing a negative effects. It seems some information on what got slower would help. Did you see the shuffle reads taking longer, for example?
Does this say: with less executor-cores, IO stopped being as much of an overhead? One potential thing that could be happening here is more pipelining of IO and compute, since there are less shuffle iterators trying to fetch (this goes along the previous comment on cores going above 96 having negative effects). The Spark stage has a "Timeline" view that may help show what tasks were doing in time. But I have the same question as before. What part of the query got faster? |
Closing, please reopen if you still have questions. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
What is your question?
I used the spark-rapids plugin to successfully use the GPU through spark3.0. Re-do TPCxBB Like query results as shown in README.in. We use a 200GB Dataset (scale factor 200), stored in csv. The processing happened on a three node cluster. Each node has 64 CPU cores, 365GB host memory, 6 Tesla T4 GPUs, and 16 GB GPU memory.
I set total-executors=6 unchanged, and changed total-executor-cores(6,12,24,48,72,96,128). In this way, 6 GPUs are used (because each executor can only be bound to one GPU), and I found that these querys were also accelerated in GPU mode. But I found some problems with iostat -x 1> io.log and nload eno4.
For Query#5, in GPU mode, the IO and network bandwidth will reach the maximum (the so-called bottleneck) and last a long time (Maximum speed of IO read or write = 500MB/s. Maximum network bandwidth speed = 1000MB/s). The more cpu cores, the longer the duration of the bottleneck. When cpu ocres>48, the duration remains almost unchanged. However, in pure CPU mode, when cpu cores>96, IO and network bandwidth will become bottlenecks.
Then I reduced the executors to 3, total-executor-cores=6. I want to eliminate the impact of network and IO bottlenecks, and simply compare CPU and GPU computing power. In pure CPU mode, all Query(5,16,21,22) have almost no bottlenecks. In GPU mode, Query#16 and Query#21 still have short-term bottlenecks.
When we experimented, the memory is sufficient, so I have two questions:
The text was updated successfully, but these errors were encountered: