Skip to content

Commit

Permalink
Fix for inserting duplicates in groupby result cache (#9508)
Browse files Browse the repository at this point in the history
Fixes #9507

Prevents inserting to groupby result cache if the result for <column, aggregation> pair is already present in the cache.
Added unit test to test this. 

**Details:**
When `add_result(col1, agg1, column1); add_result(col1, agg1, column2);` is called (see twice), then _cache doesn't contain any value for {col1, agg1} anymore.
Issue is in `_cache` `std::unordered_map` with `std::reference_wrapper<aggregation const>` in the key.

When `_cache[{input, key}] = std::move(value);` executes 2nd time, old key is destroyed. But the key's reference never changes which points to the destroyed key.
So, when compared again, `pair_column_aggregation_equal_to` fails because we are comparing a destroyed object (whose memory may have been overwritten).
#9507 (comment)

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #9508
  • Loading branch information
karthikeyann authored Oct 26, 2021
1 parent fea54e6 commit d6b624c
Show file tree
Hide file tree
Showing 2 changed files with 65 additions and 4 deletions.
8 changes: 4 additions & 4 deletions cpp/src/aggregation/result_cache.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,10 @@ void result_cache::add_result(column_view const& input,
{
// We can't guarantee that agg will outlive the cache, so we need to take ownership of a copy.
// To allow lookup by reference, make the key a reference and keep the owner in the value pair.
auto owned_agg = agg.clone();
auto const& key = *owned_agg;
auto value = std::make_pair(std::move(owned_agg), std::move(col));
_cache[{input, key}] = std::move(value);
auto owned_agg = agg.clone();
auto const& key = *owned_agg;
// try_emplace doesn't update/insert if already present
_cache.try_emplace({input, key}, std::move(owned_agg), std::move(col));
}

column_view result_cache::get_result(column_view const& input, aggregation const& agg) const
Expand Down
61 changes: 61 additions & 0 deletions cpp/tests/groupby/keys_tests.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -289,5 +289,66 @@ TEST_F(groupby_dictionary_keys_test, basic)
force_use_sort_impl::YES);
}

struct groupby_cache_test : public cudf::test::BaseFixture {
};

// To check if the cache doesn't insert multiple times to cache for same aggregation on a column in
// same request.
// If this test fails, then insert happened and key stored in cache map becomes dangling reference.
// Any comparison with same aggregation as key will fail.
TEST_F(groupby_cache_test, duplicate_agggregations)
{
using K = int32_t;
using V = int32_t;

fixed_width_column_wrapper<K> keys{1, 2, 3, 1, 2, 2, 1, 3, 3, 2};
fixed_width_column_wrapper<V> vals{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
groupby::groupby gb_obj(table_view({keys}));

std::vector<groupby::aggregation_request> requests;
requests.emplace_back(groupby::aggregation_request());
requests[0].values = vals;
requests[0].aggregations.push_back(cudf::make_sum_aggregation<groupby_aggregation>());
requests[0].aggregations.push_back(cudf::make_sum_aggregation<groupby_aggregation>());

// hash groupby
EXPECT_NO_THROW(gb_obj.aggregate(requests));

// sort groupby
// WAR to force groupby to use sort implementation
requests[0].aggregations.push_back(make_nth_element_aggregation<groupby_aggregation>(0));
EXPECT_NO_THROW(gb_obj.aggregate(requests));
}

// To check if the cache doesn't insert multiple times to cache for same aggregation on same column
// but in different requests.
// If this test fails, then insert happened and key stored in cache map becomes dangling reference.
// Any comparison with same aggregation as key will fail.
TEST_F(groupby_cache_test, duplicate_columns)
{
using K = int32_t;
using V = int32_t;

fixed_width_column_wrapper<K> keys{1, 2, 3, 1, 2, 2, 1, 3, 3, 2};
fixed_width_column_wrapper<V> vals{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
groupby::groupby gb_obj(table_view({keys}));

std::vector<groupby::aggregation_request> requests;
requests.emplace_back(groupby::aggregation_request());
requests[0].values = vals;
requests[0].aggregations.push_back(cudf::make_sum_aggregation<groupby_aggregation>());
requests.emplace_back(groupby::aggregation_request());
requests[1].values = vals;
requests[1].aggregations.push_back(cudf::make_sum_aggregation<groupby_aggregation>());

// hash groupby
EXPECT_NO_THROW(gb_obj.aggregate(requests));

// sort groupby
// WAR to force groupby to use sort implementation
requests[0].aggregations.push_back(make_nth_element_aggregation<groupby_aggregation>(0));
EXPECT_NO_THROW(gb_obj.aggregate(requests));
}

} // namespace test
} // namespace cudf

0 comments on commit d6b624c

Please sign in to comment.