[FEA] We should have regular validation jobs for parquet with zstd compression #7658

jbrennan333 · 2023-02-03T16:39:31Z

We currently run benchmarks and validation for parquet/snappy decompression. I don't think we are currently doing regular validation of snappy compression, except for the small amount done when writing out the results of queries.

As the devtech team continues to make improvements for compression/decompression in nvcomp, and cudf team continues to make improvements to parquet/orc reading/writing, we should have validation jobs that run regularly for the spark-rapids plugin to validate reading/writing zstd data.

devtech and nvcomp have their own test pipelines, but we have often found that running large query sets like nds2.0 at scale 3000 in spark can shake out bugs that escape earlier testing. We also want to ensure that we catch any data inconsistencies as soon as possible so they can be fixed earlier in the release cycle.

We have done zstd compression/decompression validation by hand for the last couple releases. A plan for this type of testing that was done for 22.12 can be found in this issue: #3037. We did the same set of steps for 23.02.

I don't think we need to do all of what is in that plan, but it gives a good outline of things we may want to test for.
Some things we might want are:

Run NDS2.0 benchmarks on GPU at scale 3000 using zstd data (produced by CPU) on a regular basis (maybe weekly) so we can track performance changes.
Run a job to convert raw nds2.0 data to parquet with zstd compression, and validate that the data matches the cpu data - we don't need to regenerate the CPU data every time. (we might also want to do this for snappy, as it is more heavily used)
Run NDS2.0 power run on GPU at scale 3000 using the GPU-generated parquet-zstd data and validate results (this particular case found a bug in decompression in 23.02)
Run NDS2.0 power run on CPU at scale using GPU-generated data (this ensures our GPU generated data is still readable on CPU)

I'm assuming we would want to do these for parquet data. Ideally we would do it for ORC as well if resources allow.

jbrennan333 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Feb 3, 2023

sameerz added test Only impacts tests and removed feature request New feature or request ? - Needs Triage Need team to review and classify labels Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] We should have regular validation jobs for parquet with zstd compression #7658

[FEA] We should have regular validation jobs for parquet with zstd compression #7658

jbrennan333 commented Feb 3, 2023 •

edited

Loading

[FEA] We should have regular validation jobs for parquet with zstd compression #7658

[FEA] We should have regular validation jobs for parquet with zstd compression #7658

Comments

jbrennan333 commented Feb 3, 2023 • edited Loading

jbrennan333 commented Feb 3, 2023 •

edited

Loading