-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support outputTimestampType being INT96 #8625
Comments
The only way I can think of to make this work is to have CUDF support us passing a DECIMAL128 with a scale of 0 to them holding the properly scaled value in it. |
There might be another way to support this that doesn't involve sending the full 96 bits or more to libcudf. Spark isn't tracking timestamps as anything more than 64 bits in microseconds, so if we could send libcudf a TIMESTAMP_MICROS column and specify in write call that we want that column encoded as an INT96, maybe libcudf could perform the micros-to-nanos conversion without overflowing the 64-bit value first trying to convert micros to nanos before writing it as an INT96 value. We would need to discuss with the libcudf team if this is an option. |
Repro notebook for the issue https://github.com/gerashegalov/rapids-shell/blob/master/src/jupyter/int96.ipynb |
This should work now out of the box. The code in cudf that does this conversion is here. |
Contributes to NVIDIA#8625, depends on rapidsai/cudf#13776 Signed-off-by: Gera Shegalov <[email protected]>
…write (#13776) Rework extraction of nanoseconds of the last day in INT96 write call path to avoid overflow. Contributes to NVIDIA/spark-rapids#8625 Fixes #8070 Authors: - Gera Shegalov (https://github.com/gerashegalov) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - MithunR (https://github.com/mythrocks) - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) URL: #13776
- Remove INT96 check for being in the Long.Min:Max range. - Update docs and tests Fixes #8625 Depends on rapidsai/cudf#8070, rapidsai/cudf#13776 Verified with the [notebook](https://github.com/gerashegalov/rapids-shell/blob/9cb4598b0feba7b71eb91f396f4b577bbb0dec00/src/jupyter/int96.ipynb) and ```bash JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 TEST_PARALLEL=0 SPARK_HOME=~/dist/spark-3.3.2-bin-hadoop3 ./integration_tests/run_pyspark_from_build.sh -k test_fix_for_int96_overflow ``` Signed-off-by: Gera Shegalov <[email protected]>
fixed by #8824 |
Is your feature request related to a problem? Please describe.
a customer reported jobs failing with the following error:
Caused by: java.lang.IllegalArgumentException: requirement failed: INT96 column contains one or more values that can overflow and will result in data corruption. Please set
spark.rapids.sql.format.parquet.writer.int96.enabledto false so we can fallback on CPU for writing parquet but still take advantage of parquet read on the GPU.
Spark currently defaults spark.sql.parquet.outputTimestampType to INT96 so I think many users just end up here. My guess is the timestamp in this case is bogus but its still an interruption to their current pipeline.
Investigate if we can add support to this in the plugin and cudf.
The text was updated successfully, but these errors were encountered: