Skip to content

Commit

Permalink
Support additional format specifiers in from_timestamps (#9047)
Browse files Browse the repository at this point in the history
Reference #5991 

This PR adds support for the following format specifiers in `cudf::strings::from_timestamp`
```
%a and %A -- weekday names (passed into the API)
%b and %B -- month names (passed into the API)
%u - ISO weekday (1-7)
%w - weekday (0-6)
%U - week of the year (Sunday based)
%W - week of the year (Monday based)
%V - ISO week of the year
%G - Year based on ISO weeks
```

This adds a new parameter to the API for the caller to pass then string names for the weekdays and months. These are only required if the `%a, %b, %A, %B` specifiers are contained in the format string.

The change to `from_timestamps` is mainly a rewrite to include logic for these specifiers. Some common code required corresponding changes to `to_timestamps` and `is_timestamps` though these functions have not changed in this PR.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Conor Hoekstra (https://github.com/codereport)

URL: #9047
  • Loading branch information
davidwendt authored Aug 31, 2021
1 parent 8a3efd0 commit fdcb90a
Show file tree
Hide file tree
Showing 5 changed files with 752 additions and 447 deletions.
84 changes: 75 additions & 9 deletions cpp/include/cudf/strings/convert/convert_datetime.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2020, NVIDIA CORPORATION.
* Copyright (c) 2019-2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -18,6 +18,9 @@
#include <cudf/column/column.hpp>
#include <cudf/strings/strings_column_view.hpp>

#include <string>
#include <vector>

namespace cudf {
namespace strings {
/**
Expand Down Expand Up @@ -135,33 +138,96 @@ std::unique_ptr<column> is_timestamp(
* | \%z | Always outputs "+0000" |
* | \%Z | Always outputs "UTC" |
* | \%j | Day of the year: 001-366 |
* | \%p | Only 'AM' or 'PM' |
* | \%u | ISO weekday where Monday is 1 and Sunday is 7 |
* | \%w | Weekday where Sunday is 0 and Saturday is 6 |
* | \%U | Week of the year with Sunday as the first day: 00-53 |
* | \%W | Week of the year with Monday as the first day: 00-53 |
* | \%V | Week of the year per ISO-8601 format: 01-53 |
* | \%G | Year based on the ISO-8601 weeks: 0000-9999 |
* | \%p | AM/PM from `timestamp_names::am_str/pm_str` |
* | \%a | Weekday abbreviation from the `names` parameter |
* | \%A | Weekday from the `names` parameter |
* | \%b | Month name abbreviation from the `names` parameter |
* | \%B | Month name from the `names` parameter |
*
* Additional descriptions can be found here:
* https://en.cppreference.com/w/cpp/chrono/system_clock/formatter
*
* No checking is done for invalid formats or invalid timestamp values.
* All timestamps values are formatted to UTC.
*
* Any null input entry will result in a corresponding null entry in the output column.
*
* The time units of the input column do not influence the number of digits written by
* the "%f" specifier.
* The "%f" supports a precision value to write out numeric digits for the subsecond value.
* Specify the precision with a single integer value (1-9) between the "%" and the "f" as follows:
* use "%3f" for milliseconds, "%6f" for microseconds and "%9f" for nanoseconds.
* If the precision is higher than the units, then zeroes are padded to the right of
* the subsecond value.
* If the precision is lower than the units, the subsecond value may be truncated.
* the "%f" specifier. The "%f" supports a precision value to write out numeric digits
* for the subsecond value. Specify the precision with a single integer value (1-9)
* between the "%" and the "f" as follows: use "%3f" for milliseconds, use "%6f" for
* microseconds and use "%9f" for nanoseconds. If the precision is higher than the
* units, then zeroes are padded to the right of the subsecond value. If the precision
* is lower than the units, the subsecond value may be truncated.
*
* If the "%a", "%A", "%b", "%B" specifiers are included in the format, the caller
* should provide the format names in the `names` strings column using the following
* as a guide:
*
* @code{.pseudo}
* ["AM", "PM", // specify the AM/PM strings
* "Sunday", "Monday", ..., "Saturday", // Weekday full names
* "Sun", "Mon", ..., "Sat", // Weekday abbreviated names
* "January", "February", ..., "December", // Month full names
* "Jan", "Feb", ..., "Dec"] // Month abbreviated names
* @endcode
*
* The result is undefined if the format names are not provided for these specifiers.
*
* These format names can be retrieved for specific locales using the `nl_langinfo`
* functions from C++ `clocale` (std) library or the Python `locale` library.
*
* The following code is an example of retrieving these strings from the locale
* using c++ std functions:
*
* @code{.cpp}
* #include <clocale>
* #include <langinfo.h>
*
* // note: install language pack on Ubuntu using 'apt-get install language-pack-de'
* {
* // set to a German language locale for date settings
* std::setlocale(LC_TIME, "de_DE.UTF-8");
*
* std::vector<std::string> names({nl_langinfo(AM_STR), nl_langinfo(PM_STR),
* nl_langinfo(DAY_1), nl_langinfo(DAY_2), nl_langinfo(DAY_3), nl_langinfo(DAY_4),
* nl_langinfo(DAY_5), nl_langinfo(DAY_6), nl_langinfo(DAY_7),
* nl_langinfo(ABDAY_1), nl_langinfo(ABDAY_2), nl_langinfo(ABDAY_3), nl_langinfo(ABDAY_4),
* nl_langinfo(ABDAY_5), nl_langinfo(ABDAY_6), nl_langinfo(ABDAY_7),
* nl_langinfo(MON_1), nl_langinfo(MON_2), nl_langinfo(MON_3), nl_langinfo(MON_4),
* nl_langinfo(MON_5), nl_langinfo(MON_6), nl_langinfo(MON_7), nl_langinfo(MON_8),
* nl_langinfo(MON_9), nl_langinfo(MON_10), nl_langinfo(MON_11), nl_langinfo(MON_12),
* nl_langinfo(ABMON_1), nl_langinfo(ABMON_2), nl_langinfo(ABMON_3), nl_langinfo(ABMON_4),
* nl_langinfo(ABMON_5), nl_langinfo(ABMON_6), nl_langinfo(ABMON_7), nl_langinfo(ABMON_8),
* nl_langinfo(ABMON_9), nl_langinfo(ABMON_10), nl_langinfo(ABMON_11), nl_langinfo(ABMON_12)});
*
* std::setlocale(LC_TIME,""); // reset to default locale
* }
* @endcode
*
* @throw cudf::logic_error if `timestamps` column parameter is not a timestamp type.
* @throw cudf::logic_error if the `format` string is empty
* @throw cudf::logic_error if `names.size()` is an invalid size. Must be 0 or 40 strings.
*
* @param timestamps Timestamp values to convert.
* @param format The string specifying output format.
* Default format is "%Y-%m-%dT%H:%M:%SZ".
* @param names The string names to use for weekdays ("%a", "%A") and months ("%b", "%B")
* Default is an empty `strings_column_view`.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New strings column with formatted timestamps.
*/
std::unique_ptr<column> from_timestamps(
column_view const& timestamps,
std::string const& format = "%Y-%m-%dT%H:%M:%SZ",
strings_column_view const& names = strings_column_view(column_view{
data_type{type_id::STRING}, 0, nullptr}),
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of doxygen group
Expand Down
3 changes: 2 additions & 1 deletion cpp/include/cudf/strings/detail/converters.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -100,12 +100,13 @@ std::unique_ptr<cudf::column> to_timestamps(strings_column_view const& strings,

/**
* @copydoc from_timestamps(strings_column_view const&,std::string
* const&,rmm::mr::device_memory_resource*)
* const&,strings_column_view const&,rmm::mr::device_memory_resource*)
*
* @param stream CUDA stream used for device memory operations and kernel launches.
*/
std::unique_ptr<column> from_timestamps(column_view const& timestamps,
std::string const& format,
strings_column_view const& names,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr);

Expand Down
7 changes: 6 additions & 1 deletion cpp/src/io/csv/writer_impl.cu
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,12 @@ struct column_to_strings_fn {
format = "\"" + format + "\"";
}

return cudf::strings::detail::from_timestamps(column, format, stream_, mr_);
return cudf::strings::detail::from_timestamps(
column,
format,
strings_column_view(column_view{data_type{type_id::STRING}, 0, nullptr}),
stream_,
mr_);
}

template <typename column_type>
Expand Down
Loading

0 comments on commit fdcb90a

Please sign in to comment.