Skip to content

Commit

Permalink
Fix to_csv delimiter handling of timestamp format(#7023)
Browse files Browse the repository at this point in the history
Closes #6699 
The timestamp format(s) used by the CSV writer have the form `%Y-%m-%dT%H:%M:%SZ`. This means if the column delimiter `','` or the line delimiter `\n` is either `':'` or `'-'` then the timestamp string output could conflict with these delimiters. The current logic simply removed these delimiters from the format if they detected a conflicting column or line delimiter. For example, specifying a dash `'-'` as column delimiter caused the timestamp format to change to `%Y%m%d...` (the dash is removed). I admit this was kind of hacky and also made the output inconsistent with Pandas `to_csv()`.

It is easy enough to simply add double-quotes around the timestamp format to prevent these conflicts as well as make the output consistent. This PR fixes that logic.

Exception logic to check for a dash as column separator was also found in [csv.py](https://github.com/rapidsai/cudf/blob/8c1f01e1fd713d873cf3d943ab409f3e9efc48f8/python/cudf/cudf/io/csv.py#L139-L149), specifically citing issue 6699 in the exception message. Also, there was a pytest specifically created to check for this exception. The exception is removed and the pytest function updated in this PR as well.

Authors:
  - davidwendt <[email protected]>

Approvers:
  - GALI PREM SAGAR
  - Karthikeyan
  - null

URL: #7023
  • Loading branch information
davidwendt authored Jan 4, 2021
1 parent af41136 commit ca1a4d6
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 31 deletions.
10 changes: 3 additions & 7 deletions cpp/src/io/csv/writer_impl.cu
Original file line number Diff line number Diff line change
Expand Up @@ -311,19 +311,15 @@ struct column_to_strings_fn {
}();

// handle the cases where delimiter / line-terminator can be
// "-" or ":", in which case they are to be dropped from the format:
// "-" or ":", in which case we need to add quotes to the format
//
std::string delimiter{options_.get_inter_column_delimiter()};
std::string newline{options_.get_line_terminator()};

constexpr char const* dash{"-"};
constexpr char const* colon{":"};
if (delimiter == dash || newline == dash) {
format.erase(std::remove(format.begin(), format.end(), dash[0]), format.end());
}

if (delimiter == colon || newline == colon) {
format.erase(std::remove(format.begin(), format.end(), colon[0]), format.end());
if (delimiter == dash || newline == dash || delimiter == colon || newline == colon) {
format = "\"" + format + "\"";
}

auto conv_col_ptr = cudf::strings::from_timestamps(column, format, mr_);
Expand Down
12 changes: 0 additions & 12 deletions python/cudf/cudf/io/csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,18 +136,6 @@ def to_csv(
"Dataframe doesn't have the labels provided in columns"
)

if sep == "-":
# TODO: Remove this error once following issue is fixed:
# https://github.com/rapidsai/cudf/issues/6699
if any(
isinstance(col, cudf.core.column.DatetimeColumn)
for col in df._data.columns
):
raise ValueError(
"sep cannot be '-' when writing a datetime64 dtype to csv, "
"refer to: https://github.com/rapidsai/cudf/issues/6699"
)

# TODO: Need to typecast categorical columns to the underlying
# categories dtype to write the actual data to csv. Remove this
# workaround once following issue is fixed:
Expand Down
17 changes: 5 additions & 12 deletions python/cudf/cudf/tests/test_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -1905,21 +1905,14 @@ def test_csv_reader_category_error():
cudf.read_csv(StringIO(csv_buf), dtype="category")


def test_csv_writer_datetime_sep_error():
# TODO: Remove this test once following
# issues is fixed: https://github.com/rapidsai/cudf/issues/6699
def test_csv_writer_datetime_sep():
df = cudf.DataFrame(
{"a": cudf.Series([22343, 2323423, 234324234], dtype="datetime64[ns]")}
)

with pytest.raises(
ValueError,
match=re.escape(
"sep cannot be '-' when writing a datetime64 dtype to csv, "
"refer to: https://github.com/rapidsai/cudf/issues/6699"
),
):
df.to_csv(sep="-")
df["a"] = df["a"].astype("datetime64[s]")
expected = df.to_pandas().to_csv(date_format="%Y-%m-%dT%H:%M:%SZ", sep="-")
actual = df.to_csv(sep="-")
assert expected == actual


def test_na_filter_empty_fields():
Expand Down

0 comments on commit ca1a4d6

Please sign in to comment.