Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an int8 column in read_csv when all elements are missing #12110

Merged
3 changes: 1 addition & 2 deletions cpp/src/io/csv/reader_impl.cu
Original file line number Diff line number Diff line change
Expand Up @@ -538,8 +538,7 @@ void infer_column_types(parse_options const& parse_opts,
auto const& stats = column_stats[inf_col_idx++];
unsigned long long int_count_total =
stats.big_int_count + stats.negative_small_int_count + stats.positive_small_int_count;

if (stats.null_count == num_records) {
if (stats.null_count == num_records or stats.total_count() == 0) {
// Entire column is NULL; allocate the smallest amount of memory
column_types[col_idx] = data_type(cudf::type_id::INT8);
} else if (stats.string_count > 0L) {
Expand Down
5 changes: 5 additions & 0 deletions cpp/src/io/utilities/column_type_histogram.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,11 @@ struct column_type_histogram {
cudf::size_type positive_small_int_count{};
cudf::size_type big_int_count{};
cudf::size_type bool_count{};
auto total_count() const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps for the future -- seems like this struct would be more versatile and more easily extendible (and totals easier to calculate in a bug-free way) if you store a std::array (or similar) of counts and index it using constant names for indices. Then this line could just be a std::accumulate (or similar).

{
return null_count + float_count + datetime_count + string_count + negative_small_int_count +
positive_small_int_count + big_int_count + bool_count;
}
};

} // namespace io
Expand Down
21 changes: 21 additions & 0 deletions cpp/tests/io/csv_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2244,6 +2244,27 @@ TEST_F(CsvReaderTest, CsvDefaultOptionsWriteReadMatch)
EXPECT_EQ(new_table_and_metadata.metadata.column_names[1], "1");
}

TEST_F(CsvReaderTest, EmptyColumns)
{
// First column only has empty fields. second column contains only "null" literals
std::string csv_in{",null\n,null"};

cudf::io::csv_reader_options in_opts =
cudf::io::csv_reader_options::builder(cudf::io::source_info{csv_in.c_str(), csv_in.size()})
.names({"a", "b", "c", "d"})
.header(-1);
// More elements in `names` than in the file; additional columns are filled with nulls
auto result = cudf::io::read_csv(in_opts);

const auto result_table = result.tbl->view();
EXPECT_EQ(result_table.num_columns(), 4);
// All columns should contain only nulls; expect INT8 type to use as little memory as possible
for (auto& column : result_table) {
EXPECT_EQ(column.type(), data_type{type_id::INT8});
EXPECT_EQ(column.null_count(), 2);
}
}

TEST_F(CsvReaderTest, BlankLineAfterFirstRow)
{
std::string csv_in{"12,9., 10\n\n"};
Expand Down