-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]Rapids Accelerator (0.2) failing to read csv file on Databricks 7.0 ML GPU Runtime #1322
Comments
this looks like spark.io.compression.codec is set to zstd, if this is set by Databricks I would expect that to just work, if the setup scripts you are using is doing it, we should not. The code that is erroring isn't even in the plugin
The issue is either the underlying library/jar is missing or its incompatible with what is expected:
when I start a databricks 7.0ML cluster it don't see the setting of spark.io.compression.codec so are you setting that? |
Do we still need to fix this ? |
I agree that it is probably not an issue any more. We support zstd and we do not support ML7.0 for databricks any more. I mostly want to be sure that we are testing zstd on databricks with CSV. Even if we just manually verify it works once, that is good enough. |
…IDIA#1322) Signed-off-by: spark-rapids automation <[email protected]>
Describe the bug
Customer is replicating the Mortgage ETL query in a Databricks environment. Data is read from S3. The same S3 data is read using a CPU cluster and it works. GPU scan fails.
It is failing on the first read:
acq = read_acq_csv(spark, orig_acq_path)
def read_acq_csv(spark, path):
return spark.read.format('csv')
.option('nullValue', '')
.option('header', 'false')
.option('delimiter', '|')
.schema(_csv_acq_schema)
.load(path)
.withColumn('quarter', _get_quarter_from_csv_file_name())
Steps/Code to reproduce bug
The cluster uses p3.2xlarge (v100) for driver and executor.
Rapids Accelerator (0.2), Databricks 7.0ML GPU Runtime
Expected behavior
Attached both logs from CPU cluster (working-log4j_cpu.log) and GPU cluster (log4j_gpu_databricks7.0.log)
Environment details (please complete the following information)
Additional context
Add any other context about the problem here.
log4j_gpu_databricks7.0.log
working-log4j_cpu.log
The text was updated successfully, but these errors were encountered: