Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] explode leaks a host memory #10244

Closed
revans2 opened this issue Feb 8, 2022 · 2 comments · Fixed by #10245
Closed

[BUG] explode leaks a host memory #10244

revans2 opened this issue Feb 8, 2022 · 2 comments · Fixed by #10245
Assignees
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Feb 8, 2022

Describe the bug
Spark has noticed host memory leaks when running large numbers of tests over a long period of time. I have been tracking them down and found that explode is leaking 32 bytes of memory in some cases when it is called.

Steps/Code to reproduce bug
I wrote a custom memory leak detector because java and valgrind are not happy together. valgrind does not really like CUDA all that much either, but if there are better tools I am happy to use them.

valgrind --leak-check=full ./gtests/LISTS_TEST ./gtests/LISTS_TEST --gtest_filter='ExplodeTest.Empty*'

Then look for the definitely leaked part (ignoring all of the warnings about IOCTLs and memory mappings...

==130808== 32 bytes in 1 blocks are definitely lost in loss record 227 of 1,274
==130808==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==130808==    by 0x59B3108: std::unique_ptr<cudf::table, std::default_delete<cudf::table> > cudf::detail::gather<int const*>(cudf::table_view const&, int const*, int const*, cudf::out_of_bounds_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x604D444: cudf::detail::(anonymous namespace)::build_table(cudf::table_view const&, int, cudf::column_view const&, cudf::device_span<int const, 18446744073709551615ul>, thrust::optional<cudf::device_span<int const, 18446744073709551615ul> >, thrust::optional<rmm::device_uvector<int> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x604DD4D: cudf::detail::explode(cudf::table_view const&, int, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x604DFA9: cudf::explode(cudf::table_view const&, int, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x555F1D: ExplodeTest_Empty_Test::TestBody() (in .../gtests/LISTS_TEST)
==130808==    by 0x13B67150: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in .../lib/libgtest.so)
==130808==    by 0x13B5B2C5: testing::Test::Run() (in .../lib/libgtest.so)
==130808==    by 0x13B5B424: testing::TestInfo::Run() (in .../lib/libgtest.so)
==130808==    by 0x13B5B54C: testing::TestSuite::Run() (in .../lib/libgtest.so)
==130808==    by 0x13B5BB02: testing::internal::UnitTestImpl::RunAllTests() (in .../lib/libgtest.so)
==130808==    by 0x13B5BD67: testing::UnitTest::Run() (in .../lib/libgtest.so)
==130808== 
==130808== 32 bytes in 1 blocks are definitely lost in loss record 228 of 1,274
==130808==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==130808==    by 0x59B3108: std::unique_ptr<cudf::table, std::default_delete<cudf::table> > cudf::detail::gather<int const*>(cudf::table_view const&, int const*, int const*, cudf::out_of_bounds_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x604D444: cudf::detail::(anonymous namespace)::build_table(cudf::table_view const&, int, cudf::column_view const&, cudf::device_span<int const, 18446744073709551615ul>, thrust::optional<cudf::device_span<int const, 18446744073709551615ul> >, thrust::optional<rmm::device_uvector<int> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x604E495: cudf::detail::explode_position(cudf::table_view const&, int, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x604E7B9: cudf::explode_position(cudf::table_view const&, int, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808==    by 0x55603D: ExplodeTest_Empty_Test::TestBody() (in .../gtests/LISTS_TEST)
==130808==    by 0x13B67150: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in .../lib/libgtest.so)
==130808==    by 0x13B5B2C5: testing::Test::Run() (in .../lib/libgtest.so)
==130808==    by 0x13B5B424: testing::TestInfo::Run() (in .../lib/libgtest.so)
==130808==    by 0x13B5B54C: testing::TestSuite::Run() (in .../lib/libgtest.so)
==130808==    by 0x13B5BB02: testing::internal::UnitTestImpl::RunAllTests() (in .../lib/libgtest.so)
==130808==    by 0x13B5BD67: testing::UnitTest::Run() (in .../lib/libgtest.so)

I have not seen leaks on join or other operations that use gather, so I suspect it is related explode itself.

Expected behavior
no leaks

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Feb 8, 2022
@revans2
Copy link
Contributor Author

revans2 commented Feb 8, 2022

As a side note I think I have tracked it down to the release being done.

std::vector<std::unique_ptr<column>> columns = gathered_table.release()->release();

When I comment out lines 70 to 99 in the file I still see leaks. When I then just return gathered_table there are no leaks.

@revans2 revans2 self-assigned this Feb 8, 2022
@revans2
Copy link
Contributor Author

revans2 commented Feb 8, 2022

Okay I figured it out. we are leaking a pointer to the table. gather_table.release() is releasing the pointer that is never freed. I'll put up a PR with a fix shortly.

rapids-bot bot pushed a commit that referenced this issue Feb 8, 2022
This fixes #10244 you can look at that for the analysis of the bug.

Authors:
  - Robert (Bobby) Evans (https://github.com/revans2)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)

URL: #10245
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants