Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Validate nvcomp-3.0 with spark rapids plugin #9461

Closed
jbrennan333 opened this issue Oct 17, 2023 · 5 comments
Closed

[FEA] Validate nvcomp-3.0 with spark rapids plugin #9461

jbrennan333 opened this issue Oct 17, 2023 · 5 comments
Assignees
Labels
feature request New feature or request task Work required that improves the product but is not user facing

Comments

@jbrennan333
Copy link
Contributor

In 23.10, cuDF is using nvcomp-2.6.1. In 23.12, we would like to move to nvcomp-3.0.x.
We need to run tests with the spark rapids plugin to ensure the updated snappy/zstd compressors/decompressors still produce correct data, ensure compression is equal to or better than with 2.6.1 and also measure any performance impact when running NDS benchmarks.

A sample validation plan is in issue #3037.

PR in cuDF for testing with nvcomp-3.0.x: rapidsai/cudf#13815
Rapids-CMake PR: rapidsai/rapids-cmake#451

@jbrennan333 jbrennan333 added the feature request New feature or request label Oct 17, 2023
@jbrennan333 jbrennan333 self-assigned this Oct 17, 2023
@jbrennan333
Copy link
Contributor Author

jbrennan333 commented Oct 24, 2023

Initial testing on desktop.

  • Run compression/decompression integration tests

  • NDS2.0 Data Conversion SNAPPY - scale 100 (desktop)

    • convert from raw data to parquet with no compression
    • convert from raw data to parquet/snappy using CPU
    • convert from raw data to parquet/snappy using GPU
    • compare sizes of all three
    • verify data matches between CPU parquet/snappy and GPU parquet/snappy
  • NDS2.0 Power Run - scale 100 (desktop) on CPU using parquet/snappy data generated by CPU.

  • NDS2.0 Power Run - scale 100 (desktop) on CPU using parquet/snappy data generated by GPU.

  • NDS2.0 Power Run - scale 100 (desktop) on GPU using parquet/snappy data generated by CPU.

  • NDS2.0 Power Run - scale 100 (desktop) on GPU using parquet/snappy data generated by GPU.

    • Compare results from these four runs.
  • NDS2.0 Data Conversion SNAPPY - scale 100 (desktop)

    • convert from raw data to orc with no compression
    • convert from raw data to orc/snappy using CPU
    • convert from raw data to orc/snappy using GPU
    • compare sizes of all three
    • verify data matches between CPU snappy and GPU snappy
  • NDS2.0 Power Run - scale 100 (desktop) on CPU using orc/snappy data generated by CPU.

  • NDS2.0 Power Run - scale 100 (desktop) on CPU using orc/snappy data generated by GPU.

  • NDS2.0 Power Run - scale 100 (desktop) on GPU using parquet/snappy data generated by CPU.

  • NDS2.0 Power Run - scale 100 (desktop) on GPU using parquet/snappy data generated by GPU.

    • Compare results from these four runs
  • NDS2.0 Data Conversion ZSTD - scale 100 (desktop)

    • convert from raw data to parquet with no compression
    • convert from raw data to parquet/snappy using CPU
    • convert from raw data to parquet/snappy using GPU
    • compare sizes of all three
  • verify data matches between CPU parquet/zstd and GPU parquet/zstd

  • NDS2.0 Power Run - scale 100 (desktop) on CPU using parquet/zstd data generated by CPU.

  • NDS2.0 Power Run - scale 100 (desktop) on CPU using parquet/zstd data generated by GPU.

  • NDS2.0 Power Run - scale 100 (desktop) on GPU using parquet/zstd data generated by CPU.

  • NDS2.0 Power Run - scale 100 (desktop) on GPU using parquet/zstd data generated by GPU.

    • Compare results from these four runs.
  • NDS2.0 Data Conversion ORC/ZSTD - scale 100 (desktop)

    • convert from raw data to orc with no compression
    • convert from raw data to orc/zstd using CPU
    • convert from raw data to orc/zstd using GPU
    • compare sizes of all three
    • verify data matches between CPU zstd and GPU zstd
  • NDS2.0 Power Run - scale 100 (desktop) on CPU using orc/zstd data generated by CPU.

  • NDS2.0 Power Run - scale 100 (desktop) on CPU using orc/zstd data generated by GPU.

  • NDS2.0 Power Run - scale 100 (desktop) on GPU using orc/zstd data generated by CPU.

  • NDS2.0 Power Run - scale 100 (desktop) on GPU using orc/zstd data generated by GPU.

    • Compare results from these four runs

Sorry, something went wrong.

@sameerz sameerz added the task Work required that improves the product but is not user facing label Oct 24, 2023
@jbrennan333
Copy link
Contributor Author

After converting nds raw data to parquet/snappy with cpu/gpu, and comparing the resulting data, I found differences in one of the tables (customer). This was using a 23.12 snapshot build with nvcomp-3.0.3 I am going to see if I can repro with the same build with nvcomp-2.6.1, to indicate whether this might be an issue in cudf vs nvcomp.

@jbrennan333
Copy link
Contributor Author

After converting nds raw data to parquet/snappy with cpu/gpu, and comparing the resulting data, I found differences in one of the tables (customer). This was using a 23.12 snapshot build with nvcomp-3.0.3 I am going to see if I can repro with the same build with nvcomp-2.6.1, to indicate whether this might be an issue in cudf vs nvcomp.

This turned out to be caused by a bug in nds_transcode.py, which was reading ISO-8859 encoded files as UTF8. So the international characters were coming through as invalid UTF8 characters, and GPU was handling writing these invalid characters differently than cpu (pass-thru vs converting to an unknown character code).
NVIDIA/spark-rapids-benchmarks#170
#9560

@jbrennan333
Copy link
Contributor Author

jbrennan333 commented Oct 27, 2023

During initial testing on desktop, I found that the output produced for query98 using parquet/zstd was unreadable with CPU in spark. In spark-3.2.1 is was reporting a corrupted page, and in spark-3.4.1 it was reading a bogus length, leading it to read beyond the limits of the file. I was able to isolate the bad page and share it with Eric Schmidt, who was able to find the bug in nvcomp. I have verified that his fix resolves the problem. He is going to include it in a 3.0.4 release.
Note that this was appearing as a compatibility issue, because newer versions of zstd (command line utility) were decompressing the bad page successfully.

https://gitlab-master.nvidia.com/GPUDB/nvcomp/-/issues/541

@jbrennan333
Copy link
Contributor Author

nvcomp-3.0.4 has been pulled into cudf/spark-rapids builds, and additional work to validate is being done by another team, so I am going to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request task Work required that improves the product but is not user facing
Projects
None yet
Development

No branches or pull requests

2 participants