Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] We should have regular validation jobs for parquet with zstd compression #7658

Open
jbrennan333 opened this issue Feb 3, 2023 · 0 comments
Labels
test Only impacts tests

Comments

@jbrennan333
Copy link
Contributor

jbrennan333 commented Feb 3, 2023

We currently run benchmarks and validation for parquet/snappy decompression. I don't think we are currently doing regular validation of snappy compression, except for the small amount done when writing out the results of queries.

As the devtech team continues to make improvements for compression/decompression in nvcomp, and cudf team continues to make improvements to parquet/orc reading/writing, we should have validation jobs that run regularly for the spark-rapids plugin to validate reading/writing zstd data.

devtech and nvcomp have their own test pipelines, but we have often found that running large query sets like nds2.0 at scale 3000 in spark can shake out bugs that escape earlier testing. We also want to ensure that we catch any data inconsistencies as soon as possible so they can be fixed earlier in the release cycle.

We have done zstd compression/decompression validation by hand for the last couple releases. A plan for this type of testing that was done for 22.12 can be found in this issue: #3037. We did the same set of steps for 23.02.

I don't think we need to do all of what is in that plan, but it gives a good outline of things we may want to test for.
Some things we might want are:

  • Run NDS2.0 benchmarks on GPU at scale 3000 using zstd data (produced by CPU) on a regular basis (maybe weekly) so we can track performance changes.
  • Run a job to convert raw nds2.0 data to parquet with zstd compression, and validate that the data matches the cpu data - we don't need to regenerate the CPU data every time. (we might also want to do this for snappy, as it is more heavily used)
  • Run NDS2.0 power run on GPU at scale 3000 using the GPU-generated parquet-zstd data and validate results (this particular case found a bug in decompression in 23.02)
  • Run NDS2.0 power run on CPU at scale using GPU-generated data (this ensures our GPU generated data is still readable on CPU)

I'm assuming we would want to do these for parquet data. Ideally we would do it for ORC as well if resources allow.

@jbrennan333 jbrennan333 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Feb 3, 2023
@sameerz sameerz added test Only impacts tests and removed feature request New feature or request ? - Needs Triage Need team to review and classify labels Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

No branches or pull requests

2 participants