[BUG] explode leaks a host memory #10244

revans2 · 2022-02-08T13:39:32Z

Describe the bug
Spark has noticed host memory leaks when running large numbers of tests over a long period of time. I have been tracking them down and found that explode is leaking 32 bytes of memory in some cases when it is called.

Steps/Code to reproduce bug
I wrote a custom memory leak detector because java and valgrind are not happy together. valgrind does not really like CUDA all that much either, but if there are better tools I am happy to use them.

valgrind --leak-check=full ./gtests/LISTS_TEST ./gtests/LISTS_TEST --gtest_filter='ExplodeTest.Empty*'

Then look for the definitely leaked part (ignoring all of the warnings about IOCTLs and memory mappings...

==130808== 32 bytes in 1 blocks are definitely lost in loss record 227 of 1,274
==130808==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==130808==    by 0x59B3108: std::unique_ptr<cudf::table, std::default_delete<cudf::table> > cudf::detail::gather<int const*>(cudf::table_view const&, int const*, int const*, cudf::out_of_bounds_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x604D444: cudf::detail::(anonymous namespace)::build_table(cudf::table_view const&, int, cudf::column_view const&, cudf::device_span<int const, 18446744073709551615ul>, thrust::optional<cudf::device_span<int const, 18446744073709551615ul> >, thrust::optional<rmm::device_uvector<int> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x604DD4D: cudf::detail::explode(cudf::table_view const&, int, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x604DFA9: cudf::explode(cudf::table_view const&, int, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x555F1D: ExplodeTest_Empty_Test::TestBody() (in .../gtests/LISTS_TEST)
==130808==    by 0x13B67150: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in .../lib/libgtest.so)
==130808==    by 0x13B5B2C5: testing::Test::Run() (in .../lib/libgtest.so)
==130808==    by 0x13B5B424: testing::TestInfo::Run() (in .../lib/libgtest.so)
==130808==    by 0x13B5B54C: testing::TestSuite::Run() (in .../lib/libgtest.so)
==130808==    by 0x13B5BB02: testing::internal::UnitTestImpl::RunAllTests() (in .../lib/libgtest.so)
==130808==    by 0x13B5BD67: testing::UnitTest::Run() (in .../lib/libgtest.so)
==130808== 
==130808== 32 bytes in 1 blocks are definitely lost in loss record 228 of 1,274
==130808==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==130808==    by 0x59B3108: std::unique_ptr<cudf::table, std::default_delete<cudf::table> > cudf::detail::gather<int const*>(cudf::table_view const&, int const*, int const*, cudf::out_of_bounds_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x604D444: cudf::detail::(anonymous namespace)::build_table(cudf::table_view const&, int, cudf::column_view const&, cudf::device_span<int const, 18446744073709551615ul>, thrust::optional<cudf::device_span<int const, 18446744073709551615ul> >, thrust::optional<rmm::device_uvector<int> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x604E495: cudf::detail::explode_position(cudf::table_view const&, int, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x604E7B9: cudf::explode_position(cudf::table_view const&, int, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x55603D: ExplodeTest_Empty_Test::TestBody() (in .../gtests/LISTS_TEST)
==130808==    by 0x13B67150: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in .../lib/libgtest.so)
==130808==    by 0x13B5B2C5: testing::Test::Run() (in .../lib/libgtest.so)
==130808==    by 0x13B5B424: testing::TestInfo::Run() (in .../lib/libgtest.so)
==130808==    by 0x13B5B54C: testing::TestSuite::Run() (in .../lib/libgtest.so)
==130808==    by 0x13B5BB02: testing::internal::UnitTestImpl::RunAllTests() (in .../lib/libgtest.so)
==130808==    by 0x13B5BD67: testing::UnitTest::Run() (in .../lib/libgtest.so)

I have not seen leaks on join or other operations that use gather, so I suspect it is related explode itself.

Expected behavior
no leaks

The text was updated successfully, but these errors were encountered:

revans2 · 2022-02-08T13:43:35Z

As a side note I think I have tracked it down to the release being done.

cudf/cpp/src/lists/explode.cu

Line 68 in 8af4e84

    
           std::vector<std::unique_ptr<column>> columns = gathered_table.release()->release();

When I comment out lines 70 to 99 in the file I still see leaks. When I then just return gathered_table there are no leaks.

revans2 · 2022-02-08T13:46:51Z

Okay I figured it out. we are leaking a pointer to the table. gather_table.release() is releasing the pointer that is never freed. I'll put up a PR with a fix shortly.

This fixes #10244 you can look at that for the analysis of the bug. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - David Wendt (https://github.com/davidwendt) - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu) URL: #10245

revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Feb 8, 2022

revans2 self-assigned this Feb 8, 2022

revans2 mentioned this issue Feb 8, 2022

Fix small leak in explode #10245

Merged

rapids-bot bot closed this as completed in #10245 Feb 8, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] explode leaks a host memory #10244

[BUG] explode leaks a host memory #10244

revans2 commented Feb 8, 2022

revans2 commented Feb 8, 2022

revans2 commented Feb 8, 2022

[BUG] explode leaks a host memory #10244

[BUG] explode leaks a host memory #10244

Comments

revans2 commented Feb 8, 2022

revans2 commented Feb 8, 2022

revans2 commented Feb 8, 2022