Skip to content

Commit

Permalink
Add explode_outer and explode_outer_position (#7499)
Browse files Browse the repository at this point in the history
This code adds support for explode_outer and explode_outer_position. These differ from explode and explode_position by the way null and empty lists are handled. Explode discards null and empty lists and as such, lifts the child column directly out of the list column. Explode_outer must find these null and empty lists and make space for a null entry in the child column. This means we need to gather both the table and the exploded column. Further, we must make a pass on the exploded column to count these entries initially as we do not know the required size of the gather maps until we have this information and it isn't just the null count.

If there are no null or empty lists in the input, the normal explode function is called as it is simpler, but it does come at the cost of marching the offsets looking for duplicates, which indicate null or empty lists.

closes #7466

Authors:
  - Mike Wilson (@hyperbolic2346)

Approvers:
  - AJ Schmidt (@ajschmidt8)
  - Jake Hemstad (@jrhemstad)
  - Nghia Truong (@ttnghia)

URL: #7499
  • Loading branch information
hyperbolic2346 authored Mar 17, 2021
1 parent 34cccfe commit 0146f74
Show file tree
Hide file tree
Showing 13 changed files with 1,381 additions and 810 deletions.
1 change: 1 addition & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ test:
- test -f $PREFIX/include/cudf/lists/detail/copying.hpp
- test -f $PREFIX/include/cudf/lists/detail/sorting.hpp
- test -f $PREFIX/include/cudf/lists/count_elements.hpp
- test -f $PREFIX/include/cudf/lists/explode.hpp
- test -f $PREFIX/include/cudf/lists/drop_list_duplicates.hpp
- test -f $PREFIX/include/cudf/lists/extract.hpp
- test -f $PREFIX/include/cudf/lists/contains.hpp
Expand Down
2 changes: 1 addition & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,7 @@ add_library(cudf
src/lists/copying/gather.cu
src/lists/copying/segmented_gather.cu
src/lists/count_elements.cu
src/lists/explode.cu
src/lists/extract.cu
src/lists/drop_list_duplicates.cu
src/lists/lists_column_factories.cu
Expand Down Expand Up @@ -289,7 +290,6 @@ add_library(cudf
src/replace/nulls.cu
src/replace/replace.cu
src/reshape/byte_cast.cu
src/reshape/explode.cu
src/reshape/interleave_columns.cu
src/reshape/tile.cu
src/rolling/grouped_rolling.cu
Expand Down
200 changes: 200 additions & 0 deletions cpp/include/cudf/lists/explode.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <cudf/column/column.hpp>
#include <cudf/table/table_view.hpp>
#include <cudf/types.hpp>
#include <memory>

namespace cudf {

/**
* @brief Explodes a list column's elements.
*
* Any list is exploded, which means the elements of the list in each row are expanded into new rows
* in the output. The corresponding rows for other columns in the input are duplicated. Example:
* ```
* [[5,10,15], 100],
* [[20,25], 200],
* [[30], 300],
* returns
* [5, 100],
* [10, 100],
* [15, 100],
* [20, 200],
* [25, 200],
* [30, 300],
* ```
*
* Nulls and empty lists propagate in different ways depending on what is null or empty.
*```
* [[5,null,15], 100],
* [null, 200],
* [[], 300],
* returns
* [5, 100],
* [null, 100],
* [15, 100],
* ```
* Note that null lists are not included in the resulting table, but nulls inside
* lists and empty lists will be represented with a null entry for that column in that row.
*
* @param input_table Table to explode.
* @param explode_column_idx Column index to explode inside the table.
* @param mr Device memory resource used to allocate the returned column's device memory.
*
* @return A new table with explode_col exploded.
*/
std::unique_ptr<table> explode(
table_view const& input_table,
size_type explode_column_idx,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Explodes a list column's elements and includes a position column.
*
* Any list is exploded, which means the elements of the list in each row are expanded into new rows
* in the output. The corresponding rows for other columns in the input are duplicated. A position
* column is added that has the index inside the original list for each row. Example:
* ```
* [[5,10,15], 100],
* [[20,25], 200],
* [[30], 300],
* returns
* [0, 5, 100],
* [1, 10, 100],
* [2, 15, 100],
* [0, 20, 200],
* [1, 25, 200],
* [0, 30, 300],
* ```
*
* Nulls and empty lists propagate in different ways depending on what is null or empty.
*```
* [[5,null,15], 100],
* [null, 200],
* [[], 300],
* returns
* [0, 5, 100],
* [1, null, 100],
* [2, 15, 100],
* ```
* Note that null lists are not included in the resulting table, but nulls inside
* lists and empty lists will be represented with a null entry for that column in that row.
*
* @param input_table Table to explode.
* @param explode_column_idx Column index to explode inside the table.
* @param mr Device memory resource used to allocate the returned column's device memory.
*
* @return A new table with exploded value and position. The column order of return table is
* [cols before explode_input, explode_position, explode_value, cols after explode_input].
*/
std::unique_ptr<table> explode_position(
table_view const& input_table,
size_type explode_column_idx,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Explodes a list column's elements retaining any null entries or empty lists inside.
*
* Any list is exploded, which means the elements of the list in each row are expanded into new rows
* in the output. The corresponding rows for other columns in the input are duplicated. Example:
* ```
* [[5,10,15], 100],
* [[20,25], 200],
* [[30], 300],
* returns
* [5, 100],
* [10, 100],
* [15, 100],
* [20, 200],
* [25, 200],
* [30, 300],
* ```
*
* Nulls and empty lists propagate as null entries in the result.
*```
* [[5,null,15], 100],
* [null, 200],
* [[], 300],
* returns
* [5, 100],
* [null, 100],
* [15, 100],
* [null, 200],
* [null, 300],
* ```
*
* @param input_table Table to explode.
* @param explode_column_idx Column index to explode inside the table.
* @param mr Device memory resource used to allocate the returned column's device memory.
*
* @return A new table with explode_col exploded.
*/
std::unique_ptr<table> explode_outer(
table_view const& input_table,
size_type explode_column_idx,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Explodes a list column's elements retaining any null entries or empty lists and includes a
*position column.
*
* Any list is exploded, which means the elements of the list in each row are expanded into new rows
* in the output. The corresponding rows for other columns in the input are duplicated. A position
* column is added that has the index inside the original list for each row. Example:
* ```
* [[5,10,15], 100],
* [[20,25], 200],
* [[30], 300],
* returns
* [0, 5, 100],
* [1, 10, 100],
* [2, 15, 100],
* [0, 20, 200],
* [1, 25, 200],
* [0, 30, 300],
* ```
*
* Nulls and empty lists propagate as null entries in the result.
*```
* [[5,null,15], 100],
* [null, 200],
* [[], 300],
* returns
* [0, 5, 100],
* [1, null, 100],
* [2, 15, 100],
* [0, null, 200],
* [0, null, 300],
* ```
*
* @param input_table Table to explode.
* @param explode_column_idx Column index to explode inside the table.
* @param mr Device memory resource used to allocate the returned column's device memory.
*
* @return A new table with explode_col exploded.
*/
std::unique_ptr<table> explode_outer_position(
table_view const& input_table,
size_type explode_column_idx,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of group

} // namespace cudf
86 changes: 0 additions & 86 deletions cpp/include/cudf/reshape.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -97,92 +97,6 @@ std::unique_ptr<column> byte_cast(
flip_endianness endian_configuration,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Explodes a list column's elements.
*
* Any list is exploded, which means the elements of the list in each row are expanded into new rows
* in the output. The corresponding rows for other columns in the input are duplicated. Example:
* ```
* [[5,10,15], 100],
* [[20,25], 200],
* [[30], 300],
* returns
* [5, 100],
* [10, 100],
* [15, 100],
* [20, 200],
* [25, 200],
* [30, 300],
* ```
*
* Nulls and empty lists propagate in different ways depending on what is null or empty.
*```
* [[5,null,15], 100],
* [null, 200],
* [[], 300],
* returns
* [5, 100],
* [null, 100],
* [15, 100],
* ```
* Note that null lists are not included in the resulting table, but nulls inside
* lists and empty lists will be represented with a null entry for that column in that row.
*
* @param input_table Table to explode.
* @param explode_column_idx Column index to explode inside the table.
* @param mr Device memory resource used to allocate the returned column's device memory.
*
* @return A new table with explode_col exploded.
*/
std::unique_ptr<table> explode(
table_view const& input_table,
size_type explode_column_idx,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Explodes a list column's elements and includes a position column.
*
* Any list is exploded, which means the elements of the list in each row are expanded into new rows
* in the output. The corresponding rows for other columns in the input are duplicated. A position
* column is added that has the index inside the original list for each row. Example:
* ```
* [[5,10,15], 100],
* [[20,25], 200],
* [[30], 300],
* returns
* [0, 5, 100],
* [1, 10, 100],
* [2, 15, 100],
* [0, 20, 200],
* [1, 25, 200],
* [0, 30, 300],
* ```
*
* Nulls and empty lists propagate in different ways depending on what is null or empty.
*```
* [[5,null,15], 100],
* [null, 200],
* [[], 300],
* returns
* [0, 5, 100],
* [1, null, 100],
* [2, 15, 100],
* ```
* Note that null lists are not included in the resulting table, but nulls inside
* lists and empty lists will be represented with a null entry for that column in that row.
*
* @param input_table Table to explode.
* @param explode_column_idx Column index to explode inside the table.
* @param mr Device memory resource used to allocate the returned column's device memory.
*
* @return A new table with exploded value and position. The column order of return table is
* [cols before explode_input, explode_position, explode_value, cols after explode_input].
*/
std::unique_ptr<table> explode_position(
table_view const& input_table,
size_type explode_column_idx,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of group

} // namespace cudf
26 changes: 25 additions & 1 deletion cpp/include/cudf/table/table.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,27 @@ class table {
*/
std::vector<std::unique_ptr<column>> release();

/**
* @brief Returns a table_view built from a range of column indices.
*
* @throws std::out_of_range
* If any index is outside [0, num_columns())
*
* @param begin Beginning of the range
* @param end Ending of the range
* @return A table_view consisting of columns from the original table
* specified by the elements of `column_indices`
*/

template <typename InputIterator>
table_view select(InputIterator begin, InputIterator end) const
{
std::vector<column_view> columns(std::distance(begin, end));
std::transform(
begin, end, columns.begin(), [this](auto index) { return _columns.at(index)->view(); });
return table_view(columns);
}

/**
* @brief Returns a table_view with set of specified columns.
*
Expand All @@ -120,7 +141,10 @@ class table {
* @return A table_view consisting of columns from the original table
* specified by the elements of `column_indices`
*/
table_view select(std::vector<cudf::size_type> const& column_indices) const;
table_view select(std::vector<cudf::size_type> const& column_indices) const
{
return select(column_indices.begin(), column_indices.end());
};

/**
* @brief Returns a reference to the specified column
Expand Down
19 changes: 19 additions & 0 deletions cpp/include/cudf/table/table_view.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,25 @@ class table_view : public detail::table_view_base<column_view> {
*/
table_view(std::vector<table_view> const& views);

/**
* @brief Returns a table_view built from a range of column indices.
*
* @throws std::out_of_range
* If any index is outside [0, num_columns())
*
* @param begin Beginning of the range
* @param end Ending of the range
* @return A table_view consisting of columns from the original table
* specified by the elements of `column_indices`
*/
template <typename InputIterator>
table_view select(InputIterator begin, InputIterator end) const
{
std::vector<column_view> columns(std::distance(begin, end));
std::transform(begin, end, columns.begin(), [this](auto index) { return this->column(index); });
return table_view(columns);
}

/**
* @brief Returns a table_view with set of specified columns.
*
Expand Down
Loading

0 comments on commit 0146f74

Please sign in to comment.