Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues when both usecols and names options are used in read_csv #12018

Merged
merged 27 commits into from
Nov 17, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
2a7b02d
fixes
vuule Oct 27, 2022
a079b30
C++ tests fixes
vuule Oct 27, 2022
20dd06a
Python test
vuule Oct 27, 2022
f1f738b
Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …
vuule Oct 27, 2022
053e9e8
docs
vuule Oct 27, 2022
c3b8df8
Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …
vuule Oct 28, 2022
6fd14a5
allow partial column names w/o selection
vuule Oct 31, 2022
2a2c3a3
Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …
vuule Oct 31, 2022
f04e251
Spark corner case
vuule Oct 31, 2022
b1b0239
Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …
vuule Nov 8, 2022
c8cda6f
Apply suggestions from code review
vuule Nov 8, 2022
91fa4fd
Merge branch 'bug-read_csv-usecols-names' of https://github.com/vuule…
vuule Nov 8, 2022
2f4277a
expand docs
vuule Nov 8, 2022
83d9a47
style
vuule Nov 8, 2022
81ac20a
extra columns tests
vuule Nov 8, 2022
5223bbd
remove comment
vuule Nov 8, 2022
fb7a3e4
revert unrelated fix
vuule Nov 9, 2022
40c3a7a
Merge branch 'branch-22.12' into bug-read_csv-usecols-names
vuule Nov 9, 2022
1833bf1
Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …
vuule Nov 16, 2022
905ba70
review suggestions
vuule Nov 16, 2022
ec848a2
update tests to new empty column type
vuule Nov 16, 2022
fef4dd4
Merge branch 'bug-read_csv-usecols-names' of https://github.com/vuule…
vuule Nov 16, 2022
008464d
Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …
vuule Nov 16, 2022
ec9fe6a
CTAD
vuule Nov 16, 2022
b9d80a3
Merge branch 'bug-read_csv-usecols-names' of https://github.com/vuule…
vuule Nov 16, 2022
7039f22
review suggestions
vuule Nov 16, 2022
b8d16e7
add comments
vuule Nov 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 47 additions & 33 deletions cpp/src/io/csv/reader_impl.cu
Original file line number Diff line number Diff line change
Expand Up @@ -164,15 +164,17 @@ std::vector<std::string> get_column_names(std::vector<char> const& header,
const string new_col_name(first_row.data() + prev, col_name_len);
col_names.push_back(removeQuotes(new_col_name, parse_opts.quotechar));

// Stop parsing when we hit the line terminator; relevant when there is
// a blank line following the header. In this case, first_row includes
// multiple line terminators at the end, as the new recStart belongs to
// a line that comes after the blank line(s)
if (!quotation && first_row[pos] == parse_opts.terminator) { break; }
} else {
// This is the first data row, add the automatically generated name
col_names.push_back(prefix + std::to_string(num_cols));
}

// Stop parsing when we hit the line terminator; relevant when there is
// a blank line following the header. In this case, first_row includes
// multiple line terminators at the end, as the new recStart belongs to
// a line that comes after the blank line(s)
if (!quotation && first_row[pos] == parse_opts.terminator) { break; }

num_cols++;

// Skip adjacent delimiters if delim_whitespace is set
Expand Down Expand Up @@ -680,25 +682,20 @@ table_with_metadata read_csv(cudf::io::datasource* source,
auto const& row_offsets = data_row_offsets.second;

// Exclude the end-of-data row from number of rows with actual data
auto num_records = std::max(row_offsets.size(), 1ul) - 1;
auto column_flags = std::vector<column_parse::flags>();
auto column_names = std::vector<std::string>();
auto num_actual_columns = static_cast<int32_t>(reader_opts.get_names().size());
auto num_active_columns = num_actual_columns;
auto num_records = std::max(row_offsets.size(), 1ul) - 1;

// Check if the user gave us a list of column names
if (not reader_opts.get_names().empty()) {
column_flags.resize(reader_opts.get_names().size(),
column_parse::enabled | column_parse::inferred);
column_names = reader_opts.get_names();
} else {
column_names = get_column_names(
header, parse_opts.view(), reader_opts.get_header(), reader_opts.get_prefix());
auto const detected_column_names =
get_column_names(header, parse_opts.view(), reader_opts.get_header(), reader_opts.get_prefix());
auto const opts_have_all_col_names =
detected_column_names.empty() or reader_opts.get_names().size() == detected_column_names.size();
auto column_names = opts_have_all_col_names ? reader_opts.get_names() : detected_column_names;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't, names are potentially modified in a few places later in the code (empty names -> "Unnamed: col_index", mangle duplicates, apply names to selected column when index-based column selection is used).


num_actual_columns = num_active_columns = column_names.size();

column_flags.resize(num_actual_columns, column_parse::enabled | column_parse::inferred);
auto num_actual_columns = static_cast<int32_t>(column_names.size());
vuule marked this conversation as resolved.
Show resolved Hide resolved
auto num_active_columns = num_actual_columns;
auto column_flags = std::vector<column_parse::flags>(
num_actual_columns, column_parse::enabled | column_parse::inferred);

if (not opts_have_all_col_names) {
std::vector<size_t> col_loop_order(column_names.size());
auto unnamed_it = std::copy_if(
thrust::make_counting_iterator<size_t>(0),
Expand Down Expand Up @@ -761,22 +758,39 @@ table_with_metadata read_csv(cudf::io::datasource* source,
// User can specify which columns should be parsed
if (!reader_opts.get_use_cols_indexes().empty() || !reader_opts.get_use_cols_names().empty()) {
std::fill(column_flags.begin(), column_flags.end(), column_parse::disabled);
num_active_columns = 0;
vyasr marked this conversation as resolved.
Show resolved Hide resolved
}

for (const auto index : reader_opts.get_use_cols_indexes()) {
if (not reader_opts.get_use_cols_indexes().empty()) {
auto const unique_use_cols_indexes = std::set<int>(reader_opts.get_use_cols_indexes().cbegin(),
reader_opts.get_use_cols_indexes().cend());

auto const are_opts_col_names_used =
not reader_opts.get_names().empty() and not opts_have_all_col_names;
CUDF_EXPECTS(not are_opts_col_names_used or
reader_opts.get_names().size() == unique_use_cols_indexes.size(),
"Specify names of all columns in the file, or names of all selected columns");

for (auto const index : unique_use_cols_indexes) {
column_flags[index] = column_parse::enabled | column_parse::inferred;
if (are_opts_col_names_used) {
column_names[index] = reader_opts.get_names()[num_active_columns];
}
++num_active_columns;
}
num_active_columns = std::unordered_set<int>(reader_opts.get_use_cols_indexes().begin(),
reader_opts.get_use_cols_indexes().end())
.size();
}

for (const auto& name : reader_opts.get_use_cols_names()) {
const auto it = std::find(column_names.begin(), column_names.end(), name);
if (it != column_names.end()) {
auto curr_it = it - column_names.begin();
if (column_flags[curr_it] == column_parse::disabled) {
column_flags[curr_it] = column_parse::enabled | column_parse::inferred;
num_active_columns++;
}
if (not reader_opts.get_use_cols_names().empty()) {
vuule marked this conversation as resolved.
Show resolved Hide resolved
auto const unique_use_cols_names = std::unordered_set<std::string>(
reader_opts.get_use_cols_names().cbegin(), reader_opts.get_use_cols_names().cend());

for (auto const& name : unique_use_cols_names) {
auto const it = std::find(column_names.cbegin(), column_names.cend(), name);
CUDF_EXPECTS(it != column_names.end(), "Nonexistent column selected");
auto const curr_it = it - column_names.begin();
vyasr marked this conversation as resolved.
Show resolved Hide resolved
if (column_flags[curr_it] == column_parse::disabled) {
column_flags[curr_it] = column_parse::enabled | column_parse::inferred;
++num_active_columns;
}
}
}
Expand Down
25 changes: 22 additions & 3 deletions cpp/tests/io/csv_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -922,7 +922,7 @@ TEST_F(CsvReaderTest, Strings)
auto filepath = temp_env->get_temp_dir() + "Strings.csv";
{
std::ofstream outfile(filepath, std::ofstream::out);
outfile << names[0] << ',' << names[1] << ',' << '\n';
outfile << names[0] << ',' << names[1] << '\n';
outfile << "10,abc def ghi" << '\n';
outfile << "20,\"jkl mno pqr\"" << '\n';
outfile << "30,stu \"\"vwx\"\" yz" << '\n';
Expand Down Expand Up @@ -952,7 +952,7 @@ TEST_F(CsvReaderTest, StringsQuotes)
auto filepath = temp_env->get_temp_dir() + "StringsQuotes.csv";
{
std::ofstream outfile(filepath, std::ofstream::out);
outfile << names[0] << ',' << names[1] << ',' << '\n';
outfile << names[0] << ',' << names[1] << '\n';
outfile << "10,`abc,\ndef, ghi`" << '\n';
outfile << "20,`jkl, ``mno``, pqr`" << '\n';
outfile << "30,stu `vwx` yz" << '\n';
Expand Down Expand Up @@ -981,7 +981,7 @@ TEST_F(CsvReaderTest, StringsQuotesIgnored)
auto filepath = temp_env->get_temp_dir() + "StringsQuotesIgnored.csv";
{
std::ofstream outfile(filepath, std::ofstream::out);
outfile << names[0] << ',' << names[1] << ',' << '\n';
outfile << names[0] << ',' << names[1] << '\n';
outfile << "10,\"abcdef ghi\"" << '\n';
outfile << "20,\"jkl \"\"mno\"\" pqr\"" << '\n';
outfile << "30,stu \"vwx\" yz" << '\n';
Expand Down Expand Up @@ -2260,4 +2260,23 @@ TEST_F(CsvReaderTest, CsvDefaultOptionsWriteReadMatch)
EXPECT_EQ(new_table_and_metadata.metadata.column_names[1], "1");
}

TEST_F(CsvReaderTest, UseColsValidation)
{
std::string buffer = "1,2,3";
vuule marked this conversation as resolved.
Show resolved Hide resolved

cudf::io::csv_reader_options indexes_options =
cudf::io::csv_reader_options::builder(cudf::io::source_info{buffer.c_str(), buffer.size()})
.header(-1)
.names({"a", "b"})
.use_cols_indexes({0});
EXPECT_THROW(cudf::io::read_csv(indexes_options), cudf::logic_error);

cudf::io::csv_reader_options names_options =
cudf::io::csv_reader_options::builder(cudf::io::source_info{buffer.c_str(), buffer.size()})
.header(-1)
.names({"a", "b", "c"})
.use_cols_names({"nonexistent_name"});
EXPECT_THROW(cudf::io::read_csv(names_options), cudf::logic_error);
}

CUDF_TEST_PROGRAM_MAIN()
38 changes: 38 additions & 0 deletions python/cudf/cudf/tests/test_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -2130,3 +2130,41 @@ def test_default_float_bitwidth_partial(default_float_bitwidth):
)
assert read["float1"].dtype == np.dtype(f"f{default_float_bitwidth//8}")
assert read["float2"].dtype == np.dtype("f8")


@pytest.mark.parametrize(
"usecols,names",
[
# selection using indices; only names of selected columns are specified
([1, 2], ["b", "c"]),
# selection using indices; names of all columns are specified
([1, 2], ["a", "b", "c"]),
# selection using indices; duplicates
([2, 2], ["a", "b", "c"]),
# selection using indices; out of order
([2, 1], ["a", "b", "c"]),
# selection using names
(["b"], ["a", "b", "c"]),
# selection using names; multiple columns
(["b", "c"], ["a", "b", "c"]),
# selection using names; duplicates
(["c", "c"], ["a", "b", "c"]),
# selection using names; out of order
(["c", "b"], ["a", "b", "c"]),
],
)
def test_column_selection_plus_column_names(usecols, names):

lines = [
"num,datetime,text",
"123,2018-11-13T12:00:00,abc",
"456,2018-11-14T12:35:01,def",
"789,2018-11-15T18:02:59,ghi",
]

buffer = "\n".join(lines) + "\n"

assert_eq(
pd.read_csv(StringIO(buffer), usecols=usecols, names=names),
cudf.read_csv(StringIO(buffer), usecols=usecols, names=names),
)
3 changes: 2 additions & 1 deletion python/cudf/cudf/utils/ioutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -948,7 +948,8 @@
the column names: if no names are passed, header=0;
if column names are passed explicitly, header=None.
names : list of str, default None
List of column names to be used.
List of column names to be used. Needs to include names of all column in
the file, or names of all columns selected using `usecols` (indices only).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indices only? pytest example has column names too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the requirement to have the same number of elements as usecols is limited to the case when usecols contains indices.
I included the failing cases in the C++ test UseColsValidation, the options object names indicate the reason why it should fail.
I can add comments there to make it clearer. Also, let me know if this comment should be worded differently to be clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(indices only) is only confusing. does indices only mean usecols uses indices, but not column names?

Rest all looks good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expanded the related comments, I hope they make sense now.

index_col : int, string or False, default None
Column to use as the row labels of the DataFrame. Passing `index_col=False`
explicitly disables index column inference and discards the last column.
Expand Down