Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] - Spark3 question #4828

Closed
eyalhir74 opened this issue Feb 19, 2022 · 7 comments
Closed

[QST] - Spark3 question #4828

eyalhir74 opened this issue Feb 19, 2022 · 7 comments
Assignees
Labels
question Further information is requested

Comments

@eyalhir74
Copy link

eyalhir74 commented Feb 19, 2022

I'm trying to run some queries on big data. I've taken a portion of our data (only 43GB) and test some query with 15 fields in two scenarios:

24 CPU cores with 200 files, up to 400MB per file
X CPU cores with one V100 GPU with 10 files, each about 4+GB as per the tuning guide suggestions.
The GPU is mostly idle and runs much slower than the CPU. Running the Spark on the GPU with the 400MBs files, runs slow as well.

I'm using the following command to run the GPU code:
$SPARK_HOME/bin/spark-shell --master "local[10]" --driver-memory 50g --conf spark.locality.wait=0s --conf spark.rapids.memory.pinnedPool.size=30G --conf spark.sql.files.maxPartitionBytes=256m --conf spark.rapids.sql.concurrentGpuTasks=2 --conf spark.plugins=com.nvidia.spark.SQLPlugin --jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR}

Changing maxPartitionBytes or concurrentGpuTasks or any other parameter, doesn't seem to have any effect.
As far as I can see most of the time the network I/O is not working nor does the GPU.

Any idea would be highly appereciated.

@eyalhir74 eyalhir74 added ? - Needs Triage Need team to review and classify question Further information is requested labels Feb 19, 2022
@jlowe
Copy link
Contributor

jlowe commented Feb 22, 2022

With the GPU being mostly idle, I'm wondering about two possibilities:

  • is the entire query eligible to run on the GPU? There are costs to transitioning between CPU and GPU, and this could potentially cause some of the slowdown
  • is the query mostly bound by the filesystem read?

To answer the first question, you could run with the config spark.rapids.sql.explain set to true, and then you should see log messages for any portions of queries that are not on the GPU (and why they're not on the GPU). Depending on how many rows are being processed by nodes not on the GPU, it could contribute substantially to the slowdown you're seeing. Also if there are portions of the query not running on the GPU then the reduced parallelism of the GPU cluster (10 cores vs. 24) will impact the query performance.

If the query is dominated by filesystem access, then running the query with less than half of the CPU cores (10 vs. 24) could significantly slowdown the GPU run. Fetching the raw data (as opposed to decoding the data) is still processed by the CPU, so this could be a significant contributor of the slowdown in comparison. To help answer this question, you could try running with more CPU cores for your GPU-configured setup and see how it impacts the query. Separately, you could use the Spark SQL web UI to examine the graphical query plan and see if the bufferTime metric for the GpuFileSourceScanExec or BatchScanExec is significantly higher than the gpuDecodeTime. The former metric is how log the tasks spent reading the raw data from the filesystem, while the second reflects how much time the task spent waiting for the GPU to decode the raw data after it was fetched from the filesystem.

@viadea
Copy link
Collaborator

viadea commented Feb 22, 2022

To help on 1st possibility mentioned by @jlowe , we have a workload qualification doc here:
https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-workload-qualification.html
Since you already have GPU spark env, so you can refer to option #3 in above doc.

After setting spark.rapids.sql.explain=all and then check spark driver log to see if you find any CPU fallback related messages.

@jlowe jlowe self-assigned this Feb 22, 2022
@eyalhir74
Copy link
Author

Wow, that a ton of information. Thank you both!
I am first trying to create a bigger input file (as suggested in the Tunning guide). Currently I have files of 300-500MBs, trying to merge them to bigger file.
Once this is done, I'll explore all the tips you've mentioned and report back.

Thanks!

@eyalhir74
Copy link
Author

eyalhir74 commented Mar 1, 2022

@viadea Thanks for the input, very helpful :)
I have a huge dataset and huge amounts of data to be processed, seems its still a bit challenging with RAPIDS.

I've added the following as per the comments in the explain output
--conf spark.rapids.sql.explain=all --conf spark.rapids.sql.variableFloatAgg.enabled=true --conf spark.rapids.sql.castDecimalToFloat.enabled=true --conf spark.rapids.sql.incompatibleOps.enabled=true

As far as I can say, these are the remaning issues preventing the query to run entirely on the GPU:
`!Exec cannot run on GPU because ArrayTypes or MapTypes in grouping expressions are not supported

        !Exec <ShuffleExchangeExec> cannot run on GPU because not all partitioning can be replaced; Columnar exchange without columnar children is inefficient

          !Partitioning <HashPartitioning> cannot run on GPU because hash_key expression AttributeReference sort_array(InfoList#1123, true)#2078 (ArrayType(StructType(StructField(experimentId,LongType,true), StructField(experimentLayerTemplateId,LongType,true), StructField(experimentVariantId,LongType,true)),true) is not supported); hash_key expression AttributeReference CASE WHEN isnull(map_keys(pv_supplyFeaturesStat#1134)) THEN null ELSE array_intersect(sort_array(map_keys(pv_supplyFeaturesStat#1134), true), [READ_MORE,EXPLORE_MORE,TABOOLA_REMINDER,NEXT_UP]) END#2080 (ArrayType(StringType,false) is not supported)

          !Exec <HashAggregateExec> cannot run on GPU because not all expressions can be replaced; ArrayTypes or MapTypes in grouping expressions are not supported

              !Expression <SortArray> sort_array(InfoList#1123, true) cannot run on GPU because expression SortArray sort_array(InfoList#1123, true) produces an unsupported type ArrayType(StructType(StructField(experimentId,LongType,true), StructField(experimentLayerTemplateId,LongType,true), StructField(experimentVariantId,LongType,true)),true); array expression AttributeReference InfoList#1123 (child StructType(StructField(experimentId,LongType,true), StructField(experimentLayerTemplateId,LongType,true), StructField(experimentVariantId,LongType,true)) is not supported)
                @Expression <AttributeReference> InfoList#1123 could run on GPU
                @Expression <Literal> true could run on GPU

                !NOT_FOUND <ArrayIntersect> array_intersect(sort_array(map_keys(pv_supplyFeaturesStat#1134), true), [READ_MORE,EXPLORE_MORE,TABOOLA_REMINDER,NEXT_UP]) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.ArrayIntersect could be found

`

Is there anything further I can try to make it run on the GPU?
The query also gets spark killed, I'll have a look at this as well.

thanks
Eyal

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Mar 1, 2022
@jlowe
Copy link
Contributor

jlowe commented Mar 1, 2022

The RAPIDS Accelerator does not currently support hash partitioning on ArrayType, nor does it support sort_array. on ArrayType. #3715 tracks sort_array and I've filed #4887 to track adding support for GPU hashing of ArrayType.

@eyalhir74
Copy link
Author

@jlowe I've updated #4900 with all the missing ops I've encountered so far.

@jlowe
Copy link
Contributor

jlowe commented Mar 8, 2022

Are there further questions for this issue, or is it covered by the other issues?

@NVIDIA NVIDIA locked and limited conversation to collaborators Apr 27, 2022
@sameerz sameerz converted this issue into discussion #5335 Apr 27, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants