-
Notifications
You must be signed in to change notification settings - Fork 242
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST]The Tpcx_bb query#5,#16,#21,#22 on GPU are slower than CPU #697
Comments
There are a lot of possibilities as to why it is running slower, and we'll need more information to help. Some initial questions:
The results shown in that image are from running the TpcxbbLikeSpark queries on two DGX-2 machines with the input and intermediate storage systems on fast NVMe drives. Without sufficiently fast I/O the query will become I/O bound before the GPU is fully utilized. Some of the queries are only the ETL portions of the original TPCx-BB query (e.g.: query 5 also includes logistical regression which is not included in The Tuning Guide has tips on tuning the RAPIDS Accelerator. One item notably missing from the set of configs above is pinned memory. Having at least some pinned memory (e.g.: between 2g to 8g) will significantly increase the performance. You can also try reducing the shuffle partitions and other tips discussed in the tuning guide. |
I hvae a question , does different data types have a big impact on performance ? like double or decimal, in order to test tpc-ds , when i generate tpc-ds data set , i set useDoubleForDecimal=true . if we support decimal types in future , will performance be improved ? |
Yes, it can have a very significant impact on performance. The RAPIDS Accelerator currently does not support Spark's
Yes, operations that need to deal directly with Note if you are already removing all decimals from the inputs (e.g.: via |
Hi, we have a Spark cluster composed of three nodes, 36 concurrent CPU cores were utilized in the CPU-only run(by setting: --total-executor-cores=36, --conf spark.task.cpus=2). |
That implies the concurrency of your cluster is actually only 18 tasks at a time instead of 36 since you're specifying each task requires 2 CPU cores.
This is a particularly small dataset, probably too small to be effective on GPUs. GPUs are not well suited for very small amounts of data. Note that the scale factor refers to the approximate size of the entire data set, not the amount of data that will be processed by any one query against that dataset. Often queries will hit only a small fraction of that dataset, and the first thing they'll do from there is filter the data down even further before it gets to significant processing like groupby aggregates or joins. I would recommend trying this with a 100G dataset or larger.
One of the first things the TCPx-BB benchmark does is perform a database load of the CSV data into Parquet, ORC or some other columnar format that the queries are then run against. The problem with using CSV for your main dataset to query is that you'll likely be mostly I/O bound because CSV forces the entire table data to be loaded even if the query only wants to see a few columns from the table. Columnar formats such as Parquet or ORC enables loading only the data associated with the columns being accessed by the query, drastically lowering the I/O requirements for a typical query. That places more of the performance of the query in the computation rather than I/O which is where the GPU can shine. I recommend transcoding the data from CSV to Parquet before running the query. Note that the GPU can write Parquet data often much faster than the CPU, so I wouldn't be surprised if you see a nice speedup relative to the CPU just during the transcoding from CSV to Parquet (given a non-trivial amount of data to transcode). If you're already using the
When using the Spark built-in shuffle, shuffle compression will still be performed by the CPU. Having less cores available to the GPU query than the CPU query can hurt performance as a result. Given a sufficient speedup in a query you can get away with running significantly less total cores in the system than the CPU version, but when using Spark's built-in shuffle it can be harmful if the query has a significant amount of data to shuffle (and thus process through the shuffle compression codec).
There could be a number of reasons. Is your driver running with sufficient resources (e.g.: has at least a couple of free CPU cores dedicated to it, is not garbage collecting due to insufficient heap size, etc.)? It may be related to the relative speed at which stages are being executed as well. Also make sure you enable a pinned memory pool as I mentioned earlier. It can have a significant effect on performance. |
@jlowe. Yes, I have enabled the pinned memory pool. I followed your instructions to convert csv to parquet, but the running time was longer than before. The resources are sufficient in the cluster. |
Just to be clear, the timings you are seeing in your queries @YeahNew start at parquet now right? E.g. the csvToParquet function isn't being included in the timings.
I see Jason was wondering about the 100GB dataset. Are you still using 2GB in your case? e.g. it would be helpful to know what was tested since last time.
Interesting, the broadcasts you mention may be showing up at odd places with smaller datasets, this seems like something we need to investigate in our end (e.g. run with the same settings you did, and see if we can reproduce) |
@abellina, thank you for your help. Yes, the csvToParquet function isn't being included in the timings. I did not use the 100GB dataset, but I used the 20GB dataset, the GPU is still slower than the CPU. |
Thank you very much for your help, the problem has been resolved. The test results found that the acceleration effect of GPU is very significant. |
can you show me some test results compared to cpu ? thanks a lot. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
What is your question?
After generating the data set by using tpcx-bb-tools-1.3.1, I use TpcxbbLikeBench.scala [located in spark-rapids/tree/branch-0.2/integration_tests/src/main/scala/com/nvidia/spark/rapids.tests/tpcxbb] to execute Query# 5,#16,#21,#22. But it’s strange that it takes longer to execute these queries by using GPU[the queries is faster on CPU].
As shown in the picture below(I'm very sorry that the computer in my lab cannot take a screenshot.
Through the webUI interfaces, you can see that when the queries were executed under GPU, these applications are indeed using GPU resources.
The test result obviously does not match the tpcxbb-like-results.png located in "spark-rapids/docs/img". So I want to know how you got the result in the picture(tpcxbb-like-results.png). I don’t know if the parameters are missing. Can you show the script you submitted?
Below is my submission script:
#! bin/bash
/home/yexin/bigData/spark-3.0.0-bin-hadoop3. 2/bin/spark-submit
--class org.hik.TPCXBB
--master spark://10.3.68.116:7077
--executor-memory 40g
--total-executor-cores 36
--executor-cores 4
--jars '/opt/sparkRapidsPlugin/cudf-0.14-cuda10-2.jar, /opt/sparkRapidsPlugin/rapids-4-spark_2.12-0.1.0.jar'
--conf spark.executor. extraClassPath=/opt/sparkRapidsPlugin/cudf-0.14-cuda10-2.jar: /opt/sparkRapidsPlugin/rapids-4-spark_2.12-0.1.0.jar
--conf spark.driver. extraClassPath=/opt/sparkRapidsPlugin/cudf-0.14-cuda10-2.jar: /opt/sparkRapidsPlugin/rapids-4-spark_2.12-0.1.0.jar
--conf spark.plugins=com. nvidia. spark.SQLPlugin
--conf spark.rapids.sql. incompatibleOps.enabled=true
--cont spark.driver.memory=10G
--conf spark executor.memory=15G
--conf spark.rapids.memory.gpu.pooling.enabled=true
--conf spark.executor.resource.gpu.amount=1
--conf spark.rapids.memory.gpu.allocFraction=0.8
--conf spark.rapids.sql.explain=ALL
--cont spark.rapids.sql.enabled=true
--conf spark.rapids.sql.concurrentgputask=8
--conf spark.sql.shuffle.partitions=200
--conf spark.task.resource.gpu.amount=0. 25
--conf spark.shuffle.reduceLocality.enabled=false
--conf spark.rapids.sql.batchsizebytes=2147483647
--conf spark.locality.wait=0s
--conf spark.executor.extraJavaOptions="-Dai.rapids.cudf.prefer-pinned=true"
--conf spark.executor.resource.gpu.discoveryscript=/opt/sparkPapidsPlugin/getgpusresources.sh
/home/yexin/bigdata/runjars/Query5.jar
After the script is submitted, there is no error in the output of the command console.
Through the command nvidia-smi, I can see that the Memory-Usage of GPU is obviously utilized, but the Volatile GPU-Util is always 0.
The above questions confuse me, would you like to help me to solve these problems?
If you are willing to send me some documents, please send them to the mailbox: [email protected]
Thanks~
The text was updated successfully, but these errors were encountered: