Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Resolve parquet reader performance regression on V100 from #14167 #14415

Closed
GregoryKimball opened this issue Nov 15, 2023 · 4 comments · Fixed by #14706
Closed

[BUG] Resolve parquet reader performance regression on V100 from #14167 #14415

GregoryKimball opened this issue Nov 15, 2023 · 4 comments · Fixed by #14706
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Nov 15, 2023

Describe the bug
As a side effect of #14167 (see 23.10 release), we observed about 10-15% slower parquet reader benchmarks on DGX V100. This effect was observed to not impact DGX A100. However, Spark-RAPIDS reported a 4-5% slowdown in the NDS benchmarking suite, driven by changes in IO-bound benchmarks.

The changes in #14167 are not expected to impact performance at all. The difference in libcudf nvbenchmarks on V100 could be from a change in the compiler code gen, and the difference in Spark-RAPIDS NDS on A100 could relate to the multi-threaded PTDS (pre thread default stream) workflow in NDS.

This issue documents the performance data and results of investigations into the root cause.

Steps/Code to reproduce bug

On a DGX V100, you can see the difference using the this benchmark command:

./PARQUET_READER_NVBENCH --devices 0 --profile --benchmark parquet_read_io_compression --axis io=DEVICE_BUFFER --axis compression=NONE --axis cardinality=1000 --axis run_length=1

On commit b789d4ce3c090a3f25a8657d9a8582a1edb54f12 we see 1.376s time
On commit 2c19bf328ffefb97d17e5ae600197a4ea9ca4445 we see 1.572s time.

The difference is driven by longer execution time of the gpuDecodePageKernel, as observed in nsys profiling.
image

Possibly unrelated background includes the V100-only performance issue in #12577

Nsys profiles:
nsys profiles before and after.zip

@GregoryKimball GregoryKimball added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Nov 15, 2023
@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Nov 15, 2023
@GregoryKimball
Copy link
Contributor Author

GregoryKimball commented Nov 15, 2023

I collected some PTX from page_data.cu. The PTX goes from 16199 lines to 16456 lines and the raw diff is very noisy (22K lines).
page_data POST.ptx.txt
page_data PRE.ptx.txt

I wrote a script to reduce nuisance diffs from register count and code blocks id's and this diff was left:
diff_less_noise.txt

Possibly relevant? NVIDIA/cccl#1001

@GregoryKimball
Copy link
Contributor Author

GregoryKimball commented Nov 20, 2023

I did some quick testing on a V100 and the performance hotspot appears to be set_error_code:

  inline __device__ void set_error_code(decode_error err) volatile
  {
    cuda::atomic_ref<int32_t, cuda::thread_scope_block> ref{const_cast<int&>(error)};
    ref.fetch_or(static_cast<int32_t>(err), cuda::std::memory_order_relaxed);
  }
  • commenting out ref.fetch_or(... recovers performance, but disables error reporting and so is not an acceptable solution
  • dropping the inline hint did not recover performance
  • using the hint __noinline__ did not recover performance
  • removing volatile did not recover performance
  • changing the cuda::thread_scope did not affect performance
  • changing the cuda::std::memory_order did not affect performance

@GregoryKimball
Copy link
Contributor Author

GregoryKimball commented Nov 21, 2023

I continued testing the diff in #14167 and found that commenting out these two calls to set_error_code in gpuDecodeStream recovers the performance. If either one is present, we see the regression.

        if (cur > end) {
          s->set_error_code(decode_error::LEVEL_STREAM_OVERRUN);
          break;
        }
        if (level_run <= 1) {
          s->set_error_code(decode_error::INVALID_LEVEL_RUN);
          break;
        }

Since the effect only appears in the gpuDecodeStream kernel, it makes sense that this impacts list types:
image

However, this observation also suggests that the libcudf benchmark regressions on V100 may NOT have the same root cause as Spark-RAPIDS NDS regressions on A100. (because NDS does not have list types!!)

@mattahrens the observation of performance issues on V100 only for list types makes getting an A100 libcudf repro even more important!

@GregoryKimball
Copy link
Contributor Author

GregoryKimball commented Nov 23, 2023

Expanding on my last comment about set_error_code in gpuDecodeStream, if you comment out these two lines the performance regression is recovered.

        if (cur > end) {
          //s->set_error_code(decode_error::LEVEL_STREAM_OVERRUN);
          break;
        }
        if (level_run <= 1) {
          //s->set_error_code(decode_error::INVALID_LEVEL_RUN);
          break;
        }

Please note that this code path is never reached (verified with printf to be certain). However, the absence of these lines causes the following PTX diff:
EDITB.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant