You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Spark has noticed host memory leaks when running large numbers of tests over a long period of time. I have been tracking them down and found that explode is leaking 32 bytes of memory in some cases when it is called.
Steps/Code to reproduce bug
I wrote a custom memory leak detector because java and valgrind are not happy together. valgrind does not really like CUDA all that much either, but if there are better tools I am happy to use them.
Then look for the definitely leaked part (ignoring all of the warnings about IOCTLs and memory mappings...
==130808== 32 bytes in 1 blocks are definitely lost in loss record 227 of 1,274
==130808== at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==130808== by 0x59B3108: std::unique_ptr<cudf::table, std::default_delete<cudf::table> > cudf::detail::gather<int const*>(cudf::table_view const&, int const*, int const*, cudf::out_of_bounds_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808== by 0x604D444: cudf::detail::(anonymous namespace)::build_table(cudf::table_view const&, int, cudf::column_view const&, cudf::device_span<int const, 18446744073709551615ul>, thrust::optional<cudf::device_span<int const, 18446744073709551615ul> >, thrust::optional<rmm::device_uvector<int> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808== by 0x604DD4D: cudf::detail::explode(cudf::table_view const&, int, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808== by 0x604DFA9: cudf::explode(cudf::table_view const&, int, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808== by 0x555F1D: ExplodeTest_Empty_Test::TestBody() (in .../gtests/LISTS_TEST)
==130808== by 0x13B67150: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in .../lib/libgtest.so)
==130808== by 0x13B5B2C5: testing::Test::Run() (in .../lib/libgtest.so)
==130808== by 0x13B5B424: testing::TestInfo::Run() (in .../lib/libgtest.so)
==130808== by 0x13B5B54C: testing::TestSuite::Run() (in .../lib/libgtest.so)
==130808== by 0x13B5BB02: testing::internal::UnitTestImpl::RunAllTests() (in .../lib/libgtest.so)
==130808== by 0x13B5BD67: testing::UnitTest::Run() (in .../lib/libgtest.so)
==130808==
==130808== 32 bytes in 1 blocks are definitely lost in loss record 228 of 1,274
==130808== at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==130808== by 0x59B3108: std::unique_ptr<cudf::table, std::default_delete<cudf::table> > cudf::detail::gather<int const*>(cudf::table_view const&, int const*, int const*, cudf::out_of_bounds_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808== by 0x604D444: cudf::detail::(anonymous namespace)::build_table(cudf::table_view const&, int, cudf::column_view const&, cudf::device_span<int const, 18446744073709551615ul>, thrust::optional<cudf::device_span<int const, 18446744073709551615ul> >, thrust::optional<rmm::device_uvector<int> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808== by 0x604E495: cudf::detail::explode_position(cudf::table_view const&, int, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808== by 0x604E7B9: cudf::explode_position(cudf::table_view const&, int, rmm::mr::device_memory_resource*) (in .../libcudf.so)
==130808== by 0x55603D: ExplodeTest_Empty_Test::TestBody() (in .../gtests/LISTS_TEST)
==130808== by 0x13B67150: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in .../lib/libgtest.so)
==130808== by 0x13B5B2C5: testing::Test::Run() (in .../lib/libgtest.so)
==130808== by 0x13B5B424: testing::TestInfo::Run() (in .../lib/libgtest.so)
==130808== by 0x13B5B54C: testing::TestSuite::Run() (in .../lib/libgtest.so)
==130808== by 0x13B5BB02: testing::internal::UnitTestImpl::RunAllTests() (in .../lib/libgtest.so)
==130808== by 0x13B5BD67: testing::UnitTest::Run() (in .../lib/libgtest.so)
I have not seen leaks on join or other operations that use gather, so I suspect it is related explode itself.
Expected behavior
no leaks
The text was updated successfully, but these errors were encountered:
Okay I figured it out. we are leaking a pointer to the table. gather_table.release() is releasing the pointer that is never freed. I'll put up a PR with a fix shortly.
Describe the bug
Spark has noticed host memory leaks when running large numbers of tests over a long period of time. I have been tracking them down and found that explode is leaking 32 bytes of memory in some cases when it is called.
Steps/Code to reproduce bug
I wrote a custom memory leak detector because java and valgrind are not happy together. valgrind does not really like CUDA all that much either, but if there are better tools I am happy to use them.
Then look for the definitely leaked part (ignoring all of the warnings about IOCTLs and memory mappings...
I have not seen leaks on join or other operations that use gather, so I suspect it is related explode itself.
Expected behavior
no leaks
The text was updated successfully, but these errors were encountered: