Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support additional format specifiers in from_timestamps #9047

Merged
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 75 additions & 9 deletions cpp/include/cudf/strings/convert/convert_datetime.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2020, NVIDIA CORPORATION.
* Copyright (c) 2019-2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -18,6 +18,9 @@
#include <cudf/column/column.hpp>
#include <cudf/strings/strings_column_view.hpp>

#include <string>
#include <vector>

namespace cudf {
namespace strings {
/**
Expand Down Expand Up @@ -135,33 +138,96 @@ std::unique_ptr<column> is_timestamp(
* | \%z | Always outputs "+0000" |
* | \%Z | Always outputs "UTC" |
* | \%j | Day of the year: 001-366 |
* | \%p | Only 'AM' or 'PM' |
* | \%u | ISO weekday where Monday is 1 and Sunday is 7 |
* | \%w | Weekday where Sunday is 0 and Saturday is 6 |
* | \%U | Week of the year with Sunday as the first day: 00-53 |
* | \%W | Week of the year with Monday as the first day: 00-53 |
* | \%V | Week of the year per ISO-8601 format: 01-53 |
* | \%G | Year based on the ISO-8601 weeks: 0000-9999 |
* | \%p | AM/PM from `timestamp_names::am_str/pm_str` |
* | \%a | Weekday abbreviation from the `names` parameter |
* | \%A | Weekday from the `names` parameter |
* | \%b | Month name abbreviation from the `names` parameter |
* | \%B | Month name from the `names` parameter |
*
* Additional descriptions can be found here:
* https://en.cppreference.com/w/cpp/chrono/system_clock/formatter
*
* No checking is done for invalid formats or invalid timestamp values.
* All timestamps values are formatted to UTC.
*
* Any null input entry will result in a corresponding null entry in the output column.
*
* The time units of the input column do not influence the number of digits written by
* the "%f" specifier.
* The "%f" supports a precision value to write out numeric digits for the subsecond value.
* Specify the precision with a single integer value (1-9) between the "%" and the "f" as follows:
* use "%3f" for milliseconds, "%6f" for microseconds and "%9f" for nanoseconds.
* If the precision is higher than the units, then zeroes are padded to the right of
* the subsecond value.
* If the precision is lower than the units, the subsecond value may be truncated.
* the "%f" specifier. The "%f" supports a precision value to write out numeric digits
* for the subsecond value. Specify the precision with a single integer value (1-9)
* between the "%" and the "f" as follows: use "%3f" for milliseconds, use "%6f" for
* microseconds and use "%9f" for nanoseconds. If the precision is higher than the
* units, then zeroes are padded to the right of the subsecond value. If the precision
* is lower than the units, the subsecond value may be truncated.
*
* If the "%a", "%A", "%b", "%B" specifiers are included in the format, the caller
* should provide the format names in the `names` strings column using the following
* as a guide:
*
* @code{.pseudo}
* ["AM", "PM", // specify the AM/PM strings
* "Sunday", "Monday", ..., "Saturday", // Weekday full names
* "Sun", "Mon", ..., "Sat", // Weekday abbreviated names
* "January", "February", ..., "December", // Month full names
* "Jan", "Feb", ..., "Dec"] // Month abbreviated names
* @endcode
*
* The result is undefined if the format names are not provided for these specifiers.
*
* These format names can be retrieved for specific locales using the `nl_langinfo`
* functions from C++ `clocale` (std) library or the Python `locale` library.
*
* The following code is an example of retrieving these strings from the locale
* using c++ std functions:
*
* @code{.cpp}
* #include <clocale>
* #include <langinfo.h>
*
* // note: install language pack on Ubuntu using 'apt-get install language-pack-de'
* {
* // set to a German language locale for date settings
* std::setlocale(LC_TIME, "de_DE.UTF-8");
*
* std::vector<std::string> names({nl_langinfo(AM_STR), nl_langinfo(PM_STR),
* nl_langinfo(DAY_1), nl_langinfo(DAY_2), nl_langinfo(DAY_3), nl_langinfo(DAY_4),
* nl_langinfo(DAY_5), nl_langinfo(DAY_6), nl_langinfo(DAY_7),
* nl_langinfo(ABDAY_1), nl_langinfo(ABDAY_2), nl_langinfo(ABDAY_3), nl_langinfo(ABDAY_4),
* nl_langinfo(ABDAY_5), nl_langinfo(ABDAY_6), nl_langinfo(ABDAY_7),
* nl_langinfo(MON_1), nl_langinfo(MON_2), nl_langinfo(MON_3), nl_langinfo(MON_4),
* nl_langinfo(MON_5), nl_langinfo(MON_6), nl_langinfo(MON_7), nl_langinfo(MON_8),
* nl_langinfo(MON_9), nl_langinfo(MON_10), nl_langinfo(MON_11), nl_langinfo(MON_12),
* nl_langinfo(ABMON_1), nl_langinfo(ABMON_2), nl_langinfo(ABMON_3), nl_langinfo(ABMON_4),
* nl_langinfo(ABMON_5), nl_langinfo(ABMON_6), nl_langinfo(ABMON_7), nl_langinfo(ABMON_8),
* nl_langinfo(ABMON_9), nl_langinfo(ABMON_10), nl_langinfo(ABMON_11), nl_langinfo(ABMON_12)});
*
* std::setlocale(LC_TIME,""); // reset to default locale
* }
* @endcode
*
* @throw cudf::logic_error if `timestamps` column parameter is not a timestamp type.
* @throw cudf::logic_error if the `format` string is empty
* @throw cudf::logic_error if `names.size()` is an invalid size. Must be 0 or 40 strings.
*
* @param timestamps Timestamp values to convert.
* @param format The string specifying output format.
* Default format is "%Y-%m-%dT%H:%M:%SZ".
* @param names The string names to use for weekdays ("%a", "%A") and months ("%b", "%B")
* Default is an empty `strings_column_view`
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New strings column with formatted timestamps.
*/
std::unique_ptr<column> from_timestamps(
column_view const& timestamps,
std::string const& format = "%Y-%m-%dT%H:%M:%SZ",
strings_column_view const& names = strings_column_view(column_view{
data_type{type_id::STRING}, 0, nullptr}),
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of doxygen group
Expand Down
3 changes: 2 additions & 1 deletion cpp/include/cudf/strings/detail/converters.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -100,12 +100,13 @@ std::unique_ptr<cudf::column> to_timestamps(strings_column_view const& strings,

/**
* @copydoc from_timestamps(strings_column_view const&,std::string
* const&,rmm::mr::device_memory_resource*)
* const&,strings_column_view const&,rmm::mr::device_memory_resource*)
*
* @param stream CUDA stream used for device memory operations and kernel launches.
*/
std::unique_ptr<column> from_timestamps(column_view const& timestamps,
std::string const& format,
strings_column_view const& names,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr);

Expand Down
7 changes: 6 additions & 1 deletion cpp/src/io/csv/writer_impl.cu
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,12 @@ struct column_to_strings_fn {
format = "\"" + format + "\"";
}

return cudf::strings::detail::from_timestamps(column, format, stream_, mr_);
return cudf::strings::detail::from_timestamps(
column,
format,
strings_column_view(column_view{data_type{type_id::STRING}, 0, nullptr}),
stream_,
mr_);
}

template <typename column_type>
Expand Down
Loading