-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Slow/no progress with cascaded pandas udfs/mapInPandas in Databricks #10770
Comments
could u first try to increase the value of "concurrentGpuTask" to see if we can get any better perf ? |
The computation gets pretty much stuck with essentially no progress. I don't think that will make a difference. Partial stack trace after reaching this point (might be from similar but not identical example to this repro): Details
|
Hi @eordentlich, where can i get the file |
And another try is to set the |
It's a public s3 bucket/file. Can you access via spark parquet reader or s3 cli? |
I tried to reproduce this locally, but always getting the error as below, seems there is something I missed.
|
Use Set eg: hdfs dfs \
-D fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider -ls \
s3a://spark-rapids-ml-bm-datasets-public/pca/1m_3k_singlecol_float32_50_files.parquet/ or via aws cli aws s3 --no-sign-request ls \
s3://spark-rapids-ml-bm-datasets-public/pca/1m_3k_singlecol_float32_50_files.parquet/ |
i can not reproduce this locally.
|
Describe the bug
Successively applied Pandas UDFs and MapInPandas make no progress in Databricks.
Steps/Code to reproduce bug
The following then is problematic.
In this case, at least in 13.3ML, the computation slows dramatically and may be deadlocked.
Expected behavior
No slowdowns, like with baseline Spark without the plugin.
Environment details (please complete the following information)
Cluster shape: 2x workers with g5.2xlarge and driver with g4dn.xlarge
Additional context
Also, based on print statement output in the logs, the first udf appears to complete fully before the second one starts. The batches should flow through both python udfs incrementally as is the case with baseline Spark.
Might be related to: #10751
The text was updated successfully, but these errors were encountered: