Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST]The Tpcx_bb query#5,#16,#21,#22 on GPU are slower than CPU #697

Closed
YeahNew opened this issue Sep 9, 2020 · 10 comments
Closed

[QST]The Tpcx_bb query#5,#16,#21,#22 on GPU are slower than CPU #697

YeahNew opened this issue Sep 9, 2020 · 10 comments
Labels
question Further information is requested

Comments

@YeahNew
Copy link

YeahNew commented Sep 9, 2020

What is your question?
After generating the data set by using tpcx-bb-tools-1.3.1, I use TpcxbbLikeBench.scala [located in spark-rapids/tree/branch-0.2/integration_tests/src/main/scala/com/nvidia/spark/rapids.tests/tpcxbb] to execute Query# 5,#16,#21,#22. But it’s strange that it takes longer to execute these queries by using GPU[the queries is faster on CPU].
As shown in the picture below(I'm very sorry that the computer in my lab cannot take a screenshot.
picture1
Through the webUI interfaces, you can see that when the queries were executed under GPU, these applications are indeed using GPU resources.
picture2
picture3
picture4
The test result obviously does not match the tpcxbb-like-results.png located in "spark-rapids/docs/img". So I want to know how you got the result in the picture(tpcxbb-like-results.png). I don’t know if the parameters are missing. Can you show the script you submitted?
Below is my submission script:
#! bin/bash
/home/yexin/bigData/spark-3.0.0-bin-hadoop3. 2/bin/spark-submit
--class org.hik.TPCXBB
--master spark://10.3.68.116:7077
--executor-memory 40g
--total-executor-cores 36
--executor-cores 4
--jars '/opt/sparkRapidsPlugin/cudf-0.14-cuda10-2.jar, /opt/sparkRapidsPlugin/rapids-4-spark_2.12-0.1.0.jar'
--conf spark.executor. extraClassPath=/opt/sparkRapidsPlugin/cudf-0.14-cuda10-2.jar: /opt/sparkRapidsPlugin/rapids-4-spark_2.12-0.1.0.jar
--conf spark.driver. extraClassPath=/opt/sparkRapidsPlugin/cudf-0.14-cuda10-2.jar: /opt/sparkRapidsPlugin/rapids-4-spark_2.12-0.1.0.jar
--conf spark.plugins=com. nvidia. spark.SQLPlugin
--conf spark.rapids.sql. incompatibleOps.enabled=true
--cont spark.driver.memory=10G
--conf spark executor.memory=15G
--conf spark.rapids.memory.gpu.pooling.enabled=true
--conf spark.executor.resource.gpu.amount=1
--conf spark.rapids.memory.gpu.allocFraction=0.8
--conf spark.rapids.sql.explain=ALL
--cont spark.rapids.sql.enabled=true
--conf spark.rapids.sql.concurrentgputask=8
--conf spark.sql.shuffle.partitions=200
--conf spark.task.resource.gpu.amount=0. 25
--conf spark.shuffle.reduceLocality.enabled=false
--conf spark.rapids.sql.batchsizebytes=2147483647
--conf spark.locality.wait=0s
--conf spark.executor.extraJavaOptions="-Dai.rapids.cudf.prefer-pinned=true"
--conf spark.executor.resource.gpu.discoveryscript=/opt/sparkPapidsPlugin/getgpusresources.sh
/home/yexin/bigdata/runjars/Query5.jar

After the script is submitted, there is no error in the output of the command console.
Through the command nvidia-smi, I can see that the Memory-Usage of GPU is obviously utilized, but the Volatile GPU-Util is always 0.
The above questions confuse me, would you like to help me to solve these problems?
If you are willing to send me some documents, please send them to the mailbox: [email protected]
Thanks~

@YeahNew YeahNew added ? - Needs Triage Need team to review and classify question Further information is requested labels Sep 9, 2020
@jlowe
Copy link
Contributor

jlowe commented Sep 9, 2020

There are a lot of possibilities as to why it is running slower, and we'll need more information to help. Some initial questions:

  • How many concurrent CPU cores were utilized in the CPU-only run?
  • How many concurrent CPU cores and GPUs were utilized in the GPU run?
  • What TPCx-BB scale factor is the dataset, and what format is the data in (e.g.: Parquet/ORC, were decimals converted to doubles)?
  • Does the SQL UI show the queries being all on the GPU or are some operations being performed on the CPU with ColumnarToRow and RowToColumnar operations as transitions? The RAPIDS Accelerator explain output log (which is enabled via the spark.rapids.sql.explain=ALL setting already shown above) will detail what operations were not placed on the GPU and the reason(s) why.

I want to know how you got the result in the picture(tpcxbb-like-results.png)

The results shown in that image are from running the TpcxbbLikeSpark queries on two DGX-2 machines with the input and intermediate storage systems on fast NVMe drives. Without sufficiently fast I/O the query will become I/O bound before the GPU is fully utilized. Some of the queries are only the ETL portions of the original TPCx-BB query (e.g.: query 5 also includes logistical regression which is not included in TpcxbbLikeSpark), and the data setup in TpcxbbLikeSpark translates decimals to doubles when converting the CSV to Parquet or ORC.

The Tuning Guide has tips on tuning the RAPIDS Accelerator. One item notably missing from the set of configs above is pinned memory. Having at least some pinned memory (e.g.: between 2g to 8g) will significantly increase the performance. You can also try reducing the shuffle partitions and other tips discussed in the tuning guide.

@chenrui17
Copy link

  • were decimals converted to doubles)?

I hvae a question , does different data types have a big impact on performance ? like double or decimal, in order to test tpc-ds , when i generate tpc-ds data set , i set useDoubleForDecimal=true . if we support decimal types in future , will performance be improved ?

@jlowe
Copy link
Contributor

jlowe commented Sep 10, 2020

does different data types have a big impact on performance ? like double or decimal

Yes, it can have a very significant impact on performance. The RAPIDS Accelerator currently does not support Spark's DecimalType, so every operation in the query that needs to deal with that type must run on the CPU instead of the GPU. (Other parts of the query that do not have decimals can still run on the GPU, of course.)

if we support decimal types in future , will performance be improved ?

Yes, operations that need to deal directly with DecimalType may be eligible for GPU acceleration whereas they cannot today. The libcudf team is actively working on adding support for decimals that can fit in 64-bits, and we plan on adding DecimalType support in the plugin shortly after that functionality is completed.

Note if you are already removing all decimals from the inputs (e.g.: via useDoubleForDecimal in TPC-DS data generation) then there will be no performance change for queries with no decimals anywhere within it. However I believe some of the TPC-DS queries cast to DecimalType during the query. Those queries will likely perform better when the RAPIDS Accelerator plugin can support those operations on the GPU directly.

@YeahNew
Copy link
Author

YeahNew commented Sep 11, 2020

There are a lot of possibilities as to why it is running slower, and we'll need more information to help. Some initial questions:

* How many concurrent CPU cores were utilized in the CPU-only run?

* How many concurrent CPU cores and GPUs were utilized in the GPU run?

* What TPCx-BB scale factor is the dataset, and what format is the data in (e.g.: Parquet/ORC, were decimals converted to doubles)?

* Does the SQL UI show the queries being all on the GPU or are some operations being performed on the CPU with `ColumnarToRow` and `RowToColumnar` operations as transitions?  The RAPIDS Accelerator explain output log (which is enabled via the `spark.rapids.sql.explain=ALL` setting already shown above) will detail what operations were not placed on the GPU and the reason(s) why.

I want to know how you got the result in the picture(tpcxbb-like-results.png)

The results shown in that image are from running the TpcxbbLikeSpark queries on two DGX-2 machines with the input and intermediate storage systems on fast NVMe drives. Without sufficiently fast I/O the query will become I/O bound before the GPU is fully utilized. Some of the queries are only the ETL portions of the original TPCx-BB query (e.g.: query 5 also includes logistical regression which is not included in TpcxbbLikeSpark), and the data setup in TpcxbbLikeSpark translates decimals to doubles when converting the CSV to Parquet or ORC.

The Tuning Guide has tips on tuning the RAPIDS Accelerator. One item notably missing from the set of configs above is pinned memory. Having at least some pinned memory (e.g.: between 2g to 8g) will significantly increase the performance. You can also try reducing the shuffle partitions and other tips discussed in the tuning guide.

Hi, we have a Spark cluster composed of three nodes, 36 concurrent CPU cores were utilized in the CPU-only run(by setting: --total-executor-cores=36, --conf spark.task.cpus=2).
36 concurrent CPU cores and 9 GPUs on the three nodes (a total of 6 GPUs per node) were utilized in the GPU run.
That can be seen in picture1 above.
TPCx-BB scale factor is 2 (it will generate 2G data set). The input file tpye is csv, and the data format contains both decimals and integers, so it may convert decimals to doubles.
Yes, both the SQL UI and explain output log show that the queries being all on the GPU.
log1
log2
log3
Then I tried to modify total-executor-cores=12,16,20. I found that the total execution time was reduced to 23s-26s (previously it was 45s-52s). But it is still longer than 21s under CPU, why?
The command console shows that it will take a long time to execute “Adding task set XX with XX tasks” each time, why?
You are so nice, thank you for your attention. Will you continue to help me?

@jlowe
Copy link
Contributor

jlowe commented Sep 12, 2020

36 concurrent CPU cores were utilized in the CPU-only run(by setting: --total-executor-cores=36, --conf spark.task.cpus=2).

That implies the concurrency of your cluster is actually only 18 tasks at a time instead of 36 since you're specifying each task requires 2 CPU cores.

TPCx-BB scale factor is 2 (it will generate 2G data set).

This is a particularly small dataset, probably too small to be effective on GPUs. GPUs are not well suited for very small amounts of data. Note that the scale factor refers to the approximate size of the entire data set, not the amount of data that will be processed by any one query against that dataset. Often queries will hit only a small fraction of that dataset, and the first thing they'll do from there is filter the data down even further before it gets to significant processing like groupby aggregates or joins. I would recommend trying this with a 100G dataset or larger.

The input file tpye is csv

One of the first things the TCPx-BB benchmark does is perform a database load of the CSV data into Parquet, ORC or some other columnar format that the queries are then run against. The problem with using CSV for your main dataset to query is that you'll likely be mostly I/O bound because CSV forces the entire table data to be loaded even if the query only wants to see a few columns from the table. Columnar formats such as Parquet or ORC enables loading only the data associated with the columns being accessed by the query, drastically lowering the I/O requirements for a typical query. That places more of the performance of the query in the computation rather than I/O which is where the GPU can shine.

I recommend transcoding the data from CSV to Parquet before running the query. Note that the GPU can write Parquet data often much faster than the CPU, so I wouldn't be surprised if you see a nice speedup relative to the CPU just during the transcoding from CSV to Parquet (given a non-trivial amount of data to transcode). If you're already using the TpcxbbLikeSpark class, see the csvToParquet conversion function which will be helpful here.

Then I tried to modify total-executor-cores=12,16,20. I found that the total execution time was reduced to 23s-26s (previously it was 45s-52s). But it is still longer than 21s under CPU, why?

When using the Spark built-in shuffle, shuffle compression will still be performed by the CPU. Having less cores available to the GPU query than the CPU query can hurt performance as a result. Given a sufficient speedup in a query you can get away with running significantly less total cores in the system than the CPU version, but when using Spark's built-in shuffle it can be harmful if the query has a significant amount of data to shuffle (and thus process through the shuffle compression codec).

The command console shows that it will take a long time to execute “Adding task set XX with XX tasks” each time, why?

There could be a number of reasons. Is your driver running with sufficient resources (e.g.: has at least a couple of free CPU cores dedicated to it, is not garbage collecting due to insufficient heap size, etc.)? It may be related to the relative speed at which stages are being executed as well.

Also make sure you enable a pinned memory pool as I mentioned earlier. It can have a significant effect on performance.

@YeahNew
Copy link
Author

YeahNew commented Sep 15, 2020

@jlowe. Yes, I have enabled the pinned memory pool. I followed your instructions to convert csv to parquet, but the running time was longer than before. The resources are sufficient in the cluster.
I later compared the execution time of job and stage under CPU and GPU in the history WebUI interface. Found that in the broadcast exchange jobs, GPU execution is much slower. The task deserialization time is longer than CPU. From which aspect should I analyze this reason?
Thanks~

@abellina
Copy link
Collaborator

I followed your instructions to convert csv to parquet, but the running time was longer than before. The resources are sufficient in the cluster.

Just to be clear, the timings you are seeing in your queries @YeahNew start at parquet now right? E.g. the csvToParquet function isn't being included in the timings.

TPCx-BB scale factor is 2 (it will generate 2G data set).

I see Jason was wondering about the 100GB dataset. Are you still using 2GB in your case? e.g. it would be helpful to know what was tested since last time.

I later compared the execution time of job and stage under CPU and GPU in the history WebUI interface. Found that in the broadcast exchange jobs, GPU execution is much slower. The task deserialization time is longer than CPU. From which aspect should I analyze this reason?

Interesting, the broadcasts you mention may be showing up at odd places with smaller datasets, this seems like something we need to investigate in our end (e.g. run with the same settings you did, and see if we can reproduce)

@YeahNew
Copy link
Author

YeahNew commented Sep 15, 2020

@abellina, thank you for your help. Yes, the csvToParquet function isn't being included in the timings. I did not use the 100GB dataset, but I used the 20GB dataset, the GPU is still slower than the CPU.
I noticed that the DAGs under GPU are more complicated than under CPU. Is it possible that the serialized task submitted to the executor is deserialized by the CPU and then passed to the GPU to execute the task. Perhaps the above two reasons lead to the task deserialization time is longer than CPU?

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Sep 15, 2020
@YeahNew
Copy link
Author

YeahNew commented Sep 28, 2020

Thank you very much for your help, the problem has been resolved. The test results found that the acceleration effect of GPU is very significant.

@YeahNew YeahNew closed this as completed Sep 28, 2020
@chenrui17
Copy link

Thank you very much for your help, the problem has been resolved. The test results found that the acceleration effect of GPU is very significant.

can you show me some test results compared to cpu ? thanks a lot.

@NVIDIA NVIDIA locked and limited conversation to collaborators Apr 28, 2022
@sameerz sameerz converted this issue into discussion #5390 Apr 28, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants