-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Validate nvcomp-3.0 with spark rapids plugin #9461
Comments
Initial testing on desktop.
|
After converting nds raw data to parquet/snappy with cpu/gpu, and comparing the resulting data, I found differences in one of the tables (customer). This was using a 23.12 snapshot build with nvcomp-3.0.3 I am going to see if I can repro with the same build with nvcomp-2.6.1, to indicate whether this might be an issue in cudf vs nvcomp. |
This turned out to be caused by a bug in nds_transcode.py, which was reading ISO-8859 encoded files as UTF8. So the international characters were coming through as invalid UTF8 characters, and GPU was handling writing these invalid characters differently than cpu (pass-thru vs converting to an unknown character code). |
During initial testing on desktop, I found that the output produced for query98 using parquet/zstd was unreadable with CPU in spark. In spark-3.2.1 is was reporting a corrupted page, and in spark-3.4.1 it was reading a bogus length, leading it to read beyond the limits of the file. I was able to isolate the bad page and share it with Eric Schmidt, who was able to find the bug in nvcomp. I have verified that his fix resolves the problem. He is going to include it in a 3.0.4 release. |
nvcomp-3.0.4 has been pulled into cudf/spark-rapids builds, and additional work to validate is being done by another team, so I am going to close this. |
In 23.10, cuDF is using nvcomp-2.6.1. In 23.12, we would like to move to nvcomp-3.0.x.
We need to run tests with the spark rapids plugin to ensure the updated snappy/zstd compressors/decompressors still produce correct data, ensure compression is equal to or better than with 2.6.1 and also measure any performance impact when running NDS benchmarks.
A sample validation plan is in issue #3037.
PR in cuDF for testing with nvcomp-3.0.x: rapidsai/cudf#13815
Rapids-CMake PR: rapidsai/rapids-cmake#451
The text was updated successfully, but these errors were encountered: