Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Compute Sanitizer reports errors for the TableTest#testORCReadAndWriteForDecimal128 test #1338

Closed
firestarman opened this issue Aug 14, 2023 · 8 comments · Fixed by rapidsai/cudf#14897
Assignees
Labels
bug Something isn't working

Comments

@firestarman
Copy link
Collaborator

firestarman commented Aug 14, 2023

This test is excluded temporarily in PR rapidsai/cudf#13872.
Error stack from the Compute Sanitizer.

========= COMPUTE-SANITIZER
========= Program hit cudaErrorInvalidConfiguration (error 9) due to "invalid configuration argument" on CUDA API call to cudaLaunchKernel_ptsz.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4545f6]
=========                in /usr/lib64/libcuda.so.1
=========     Host Frame: [0x3199c48]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:__device_stub__ZN4cudf2io3orc3gpu27rowgroup_char_counts_kernelENS_6detail11base_2dspanIiNS_11device_spanEEENS5_IKNS1_22orc_column_device_viewELm18446744073709551615EEENS4_IKNS1_13rowgroup_rowsES5_EENS5_IKjLm18446744073709551615EEE(cudf::detail::base_2dspan<int, cudf::device_span>&, cudf::device_span<cudf::io::orc::orc_column_device_view const, 18446744073709551615ul>&, cudf::detail::base_2dspan<cudf::io::orc::rowgroup_rows const, cudf::device_span>&, cudf::device_span<unsigned int const, 18446744073709551615ul>&) [0x1a94474]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::orc::gpu::rowgroup_char_counts_kernel(cudf::detail::base_2dspan<int, cudf::device_span>, cudf::device_span<cudf::io::orc::orc_column_device_view const, 18446744073709551615ul>, cudf::detail::base_2dspan<cudf::io::orc::rowgroup_rows const, cudf::device_span>, cudf::device_span<unsigned int const, 18446744073709551615ul>) [0x1a944ae]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::detail::orc::(anonymous namespace)::set_rowgroup_char_counts(cudf::io::detail::orc::orc_table_view&, cudf::detail::base_2dspan<cudf::io::orc::rowgroup_rows const, cudf::device_span>, rmm::cuda_stream_view) [0x1ac2c21]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::detail::orc::(anonymous namespace)::convert_table_to_orc_data(cudf::table_view const&, cudf::io::table_input_metadata const&, cudf::io::detail::orc::stripe_size_limits, int, bool, cudf::io::orc::CompressionKind, unsigned long, cudf::io::statistics_freq, bool, cudf::io::detail::single_write_mode, cudf::io::data_sink const&, rmm::cuda_stream_view) [0x1ad1657]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::detail::orc::writer::impl::write(cudf::table_view const&) [0x1ad2b30]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::orc_chunked_writer::write(cudf::table_view const&) [0x1a0adc9]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:Java_ai_rapids_cudf_Table_writeORCChunk [0x120e73e]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame: [0xffffffffe6e20b26]
=========                in 
========= 
========= Program hit cudaErrorInvalidConfiguration (error 9) due to "invalid configuration argument" on CUDA API call to cudaGetLastError.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4545f6]
=========                in /usr/lib64/libcuda.so.1
=========     Host Frame: [0x3193c14]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cub::CUB_101702_600_700_750_800_860_900_NS::PtxVersion(int&) [0x112da6d]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::detail::orc::(anonymous namespace)::decimal_chunk_sizes(cudf::io::detail::orc::orc_table_view&, cudf::io::detail::orc::file_segmentation const&, rmm::cuda_stream_view) [0x1ac601a]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::detail::orc::(anonymous namespace)::convert_table_to_orc_data(cudf::table_view const&, cudf::io::table_input_metadata const&, cudf::io::detail::orc::stripe_size_limits, int, bool, cudf::io::orc::CompressionKind, unsigned long, cudf::io::statistics_freq, bool, cudf::io::detail::single_write_mode, cudf::io::data_sink const&, rmm::cuda_stream_view) [0x1ad16d7]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::detail::orc::writer::impl::write(cudf::table_view const&) [0x1ad2b30]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::orc_chunked_writer::write(cudf::table_view const&) [0x1a0adc9]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:Java_ai_rapids_cudf_Table_writeORCChunk [0x120e73e]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame: [0xffffffffe6e20b26]
=========                in 
========= 
========= ERROR SUMMARY: 2 errors

Repro steps:
1 Apply the two PRs, rapidsai/cudf#13872 and #1321
2 Remove the noSanitizer tag from this test. And run ./build/build-in-docker test -Dtest=TableTest -DUSE_SANITIZER=ON.

@firestarman firestarman added bug Something isn't working ? - Needs Triage labels Aug 14, 2023
@firestarman
Copy link
Collaborator Author

firestarman commented Aug 14, 2023

I will double confirm this after the quite long time cudf build finishes.
Confirmed, and it indeed has errors.

@firestarman firestarman changed the title [BUG] TableTest#testORCReadAndWriteForDecimal128 fails when running with Compute Sanitizer [BUG] Compute Sanitizer reports errors for the TableTest#testORCReadAndWriteForDecimal128 test Aug 14, 2023
@res-life res-life self-assigned this Aug 15, 2023
@res-life
Copy link
Collaborator

I reproduced the issue.
From the stack, it shows the cudaErrorInvalidConfiguration is from cuDF side.
I found the y dimension of dim3 is zero, details:

https://github.com/rapidsai/cudf/blob/v23.08.00/cpp/src/io/orc/dict_enc.cu#L68-L70

  auto const grid_size =
    dim3(cudf::util::div_rounding_up_unsafe<unsigned int>(num_rowgroups, block_size),
         static_cast<unsigned int>(num_str_cols));

grid_size.y is 0 when I print it out. This means the num_str_cols is zero.
From ChatGPT: Each dimension (x, y, and z) in dim3 must have a value greater than zero.

@jlowe Help confirm. It's a cuDF bug, I'll post an issue.

@res-life
Copy link
Collaborator

res-life commented Aug 15, 2023

If I update TableTest#testORCReadAndWriteForDecimal128 to write with a String column, then num_str_cols is non-zero, and the error disappeared.

    String[] colNames = new String[]{Columns.DECIMAL64.name,
        Columns.DECIMAL128.name, Columns.STRUCT_DEC128.name, Columns.LIST_DEC128.name};

==>>

    String[] colNames = new String[]{Columns.String.name,
        Columns.DECIMAL128.name, Columns.STRUCT_DEC128.name, Columns.LIST_DEC128.name};

@jlowe
Copy link
Member

jlowe commented Aug 15, 2023

Yes, I believe the dimensions are supposed to be greater than zero. If this is an illegal launch configuration, why do we only see a CUDA error when running under the sanitizer? I would expect the driver to return an error when trying to launch a kernel with an invalid launch config. Or does it treat it as a no-op instead?

@res-life
Copy link
Collaborator

res-life commented Dec 4, 2023

Depends on cuDF issue: rapidsai/cudf#13887

@GaryShen2008
Copy link
Collaborator

Seem cuDf has fixed the issue, @res-life can you verify again.

@res-life
Copy link
Collaborator

cuDF PR: rapidsai/cudf#14897

@res-life
Copy link
Collaborator

Retarget this issue to 24.04, because the cuDF branch 24.02 is locked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants