Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make timezone table independent from ORC #12805

Merged
merged 48 commits into from
Mar 22, 2023
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
c998647
remove gmt_offset
vuule Feb 16, 2023
623d25c
minor clean up
vuule Feb 16, 2023
c5518ef
another one
vuule Feb 16, 2023
979ea3f
return table!
vuule Feb 17, 2023
328db73
create tz table content properly
vuule Feb 17, 2023
640abd6
take timestamp_s in get_gmt_offset
vuule Feb 17, 2023
6e30578
return duration
vuule Feb 17, 2023
9ae2a26
use chrono types in ORC
vuule Feb 18, 2023
76d0ae1
add path param
vuule Feb 18, 2023
2f09f59
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Feb 18, 2023
449648d
style
vuule Feb 18, 2023
f92f1e1
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Feb 28, 2023
f05cc01
move declarations to /include
vuule Feb 28, 2023
7c5e32c
drop cont&
vuule Feb 28, 2023
ebca7ab
more param clean up
vuule Feb 28, 2023
dcb7179
lil fix
vuule Feb 28, 2023
e3a656a
style
vuule Feb 28, 2023
df181fa
Merge branch 'branch-23.04' into timezone-refactor
vuule Feb 28, 2023
b6ec20c
lambda
vuule Feb 28, 2023
df54ee2
Merge branch 'timezone-refactor' of https://github.com/vuule/cudf int…
vuule Feb 28, 2023
2895a29
Merge branch 'branch-23.04' into timezone-refactor
vuule Mar 4, 2023
3fcde67
Merge branch 'branch-23.04' into timezone-refactor
vuule Mar 6, 2023
9611bda
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 6, 2023
49ffb2c
move out of detail
vuule Mar 6, 2023
104a32c
Merge branch 'timezone-refactor' of https://github.com/vuule/cudf int…
vuule Mar 6, 2023
dbfc69d
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 6, 2023
3d86058
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 6, 2023
b90dd7f
fix includes
vuule Mar 6, 2023
e55b950
style
vuule Mar 6, 2023
b13457a
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 9, 2023
38efd05
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 13, 2023
bb172e4
improve naming
vuule Mar 13, 2023
1285488
typo
vuule Mar 13, 2023
2216ccd
moar clarifications!
vuule Mar 13, 2023
9b517b6
move constant to common header
vuule Mar 13, 2023
675b9d7
yaml sort
vuule Mar 13, 2023
3cf32ee
use UT instead of GMT
vuule Mar 13, 2023
3b52745
fix
vuule Mar 13, 2023
d326b69
naming
vuule Mar 14, 2023
3e28bdf
change example
vuule Mar 14, 2023
9050605
clarify obscure calc
vuule Mar 14, 2023
af02c84
separate detail API
vuule Mar 14, 2023
b0a7b10
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 20, 2023
4407b4f
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 21, 2023
bd9943f
move timezone.cpp out of ORC
vuule Mar 21, 2023
69dd660
make_timezone_transition_table comment
vuule Mar 21, 2023
eb05227
Merge branch 'branch-23.04' into timezone-refactor
vuule Mar 21, 2023
a5709e1
type fix
vuule Mar 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ outputs:
- test -f $PREFIX/include/cudf/concatenate.hpp
- test -f $PREFIX/include/cudf/copying.hpp
- test -f $PREFIX/include/cudf/datetime.hpp
- test -f $PREFIX/include/cudf/timezone.hpp
- test -f $PREFIX/include/cudf/detail/aggregation/aggregation.hpp
- test -f $PREFIX/include/cudf/detail/aggregation/result_cache.hpp
- test -f $PREFIX/include/cudf/detail/binaryop.hpp
Expand Down Expand Up @@ -128,6 +129,8 @@ outputs:
- test -f $PREFIX/include/cudf/detail/stream_compaction.hpp
- test -f $PREFIX/include/cudf/detail/structs/utilities.hpp
- test -f $PREFIX/include/cudf/detail/tdigest/tdigest.hpp
- test -f $PREFIX/include/cudf/detail/timezone.cuh
- test -f $PREFIX/include/cudf/detail/timezone.hpp
- test -f $PREFIX/include/cudf/detail/transform.hpp
- test -f $PREFIX/include/cudf/detail/transpose.hpp
- test -f $PREFIX/include/cudf/detail/unary.hpp
Expand Down
2 changes: 1 addition & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -369,7 +369,7 @@ add_library(
src/io/orc/stripe_data.cu
src/io/orc/stripe_enc.cu
src/io/orc/stripe_init.cu
src/io/orc/timezone.cpp
src/datetime/timezone.cpp
src/io/orc/writer_impl.cu
src/io/parquet/compact_protocol_reader.cpp
src/io/parquet/compact_protocol_writer.cpp
Expand Down
79 changes: 79 additions & 0 deletions cpp/include/cudf/detail/timezone.cuh
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/table/table_device_view.cuh>
#include <cudf/timezone.hpp>
#include <cudf/types.hpp>
#include <cudf/utilities/span.hpp>

#include <thrust/binary_search.h>
#include <thrust/execution_policy.h>

namespace cudf::detail {

/**
* @brief Returns the UT offset for a given date and given timezone table.
*
* @param transition_times Transition times; trailing `solar_cycle_entry_count` entries are used for
* all times beyond the one covered by the TZif file
* @param offsets Time offsets in specific intervals; trailing `solar_cycle_entry_count` entries are
* used for all times beyond the one covered by the TZif file
* @param ts ORC timestamp
*
* @return offset from UT, in seconds
*/
inline __device__ duration_s get_ut_offset(table_device_view tz_table, timestamp_s ts)
{
if (tz_table.num_rows() == 0) { return duration_s{0}; }

cudf::device_span<timestamp_s const> transition_times(tz_table.column(0).head<timestamp_s>(),
static_cast<size_t>(tz_table.num_rows()));

auto const ts_ttime_it = [&]() {
auto last_less_equal = [](auto begin, auto end, auto value) {
auto const first_larger = thrust::upper_bound(thrust::seq, begin, end, value);
// Return start of the range if all elements are larger than the value
if (first_larger == begin) return begin;
// Element before the first larger element is the last one less or equal
return first_larger - 1;
};

auto const file_entry_end =
transition_times.begin() + (transition_times.size() - solar_cycle_entry_count);

if (ts <= *(file_entry_end - 1)) {
// Search the file entries if the timestamp is in range
return last_less_equal(transition_times.begin(), file_entry_end, ts);
} else {
auto project_to_cycle = [](timestamp_s ts) {
// Years divisible by four are leap years
// Exceptions are years divisible by 100, but not divisible by 400
static constexpr int32_t num_leap_years_in_cycle =
solar_cycle_years / 4 - (solar_cycle_years / 100 - solar_cycle_years / 400);
static constexpr duration_s cycle_s = cuda::std::chrono::duration_cast<duration_s>(
duration_D{365 * solar_cycle_years + num_leap_years_in_cycle});
return timestamp_s{(ts.time_since_epoch() + cycle_s) % cycle_s};
};
// Search the 400-year cycle if outside of the file entries range
return last_less_equal(file_entry_end, transition_times.end(), project_to_cycle(ts));
}
}();

return tz_table.column(1).element<duration_s>(ts_ttime_it - transition_times.begin());
}

} // namespace cudf::detail
36 changes: 36 additions & 0 deletions cpp/include/cudf/detail/timezone.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/timezone.hpp>

#include <rmm/cuda_stream_view.hpp>

namespace cudf::detail {

/**
* @copydoc cudf::make_timezone_transition_table(std::optional<std::string_view>, std::string_view,
* rmm::mr::device_memory_resource*)
*
* @param stream CUDA stream used for device memory operations and kernel launches.
*/
std::unique_ptr<table> make_timezone_transition_table(
std::optional<std::string_view> tzif_dir,
std::string_view timezone_name,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

} // namespace cudf::detail
54 changes: 54 additions & 0 deletions cpp/include/cudf/timezone.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <rmm/mr/device/per_device_resource.hpp>

#include <memory>
#include <optional>
#include <string>

namespace cudf {
class table;

// Cycle in which the time offsets repeat in Gregorian calendar
static constexpr int32_t solar_cycle_years = 400;
// Number of future entries in the timezone transition table:
// Two entries per year, over the length of the Gregorian calendar's solar cycle
vuule marked this conversation as resolved.
Show resolved Hide resolved
static constexpr uint32_t solar_cycle_entry_count = 2 * solar_cycle_years;

/**
* @brief Creates a transition table to convert ORC timestamps to UTC.
*
* Uses system's TZif files. Assumes little-endian platform when parsing these files.
* The transition table starts with the entries from the TZif file. For timestamps after the file's
* last transition, the table includes entires that form a `solar_cycle_years`-year cycle (future
vuule marked this conversation as resolved.
Show resolved Hide resolved
* entries). This portion of the table has `solar_cycle_entry_count` elements, as it assumes two
* transitions per year from Daylight Saving Time. If the timezone does not have DST, the table will
* still include the future entries, which will all have the same offset.
*
* @param tzif_dir The directory where the TZif files are located
* @param timezone_name standard timezone name (for example, "America/Los_Angeles")
* @param mr Device memory resource used to allocate the returned table's device memory.
*
* @return The transition table for the given timezone
*/
std::unique_ptr<table> make_timezone_transition_table(
std::optional<std::string_view> tzif_dir,
std::string_view timezone_name,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

} // namespace cudf
108 changes: 72 additions & 36 deletions cpp/src/io/orc/timezone.cpp → cpp/src/datetime/timezone.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,24 +13,24 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "timezone.cuh"
#include <cudf/detail/timezone.hpp>

#include <cudf/column/column_factories.hpp>
#include <cudf/detail/nvtx/ranges.hpp>
#include <cudf/detail/utilities/vector_factories.hpp>

#include <rmm/mr/device/per_device_resource.hpp>
#include <cudf/table/table.hpp>

#include <algorithm>
#include <filesystem>
#include <fstream>

namespace cudf {
namespace io {

namespace {

constexpr uint32_t tzif_magic = ('T' << 0) | ('Z' << 8) | ('i' << 16) | ('f' << 24);
std::string const tzif_system_directory = "/usr/share/zoneinfo/";

// Seconds from Jan 1st, 1970 to Jan 1st, 2015
constexpr int64_t orc_utc_offset = 1420070400;

#pragma pack(push, 1)
/**
* @brief 32-bit TZif header
Expand Down Expand Up @@ -129,12 +129,13 @@ struct timezone_file {
"Number of transition times is larger than the file size.");
}

timezone_file(std::string const& timezone_name)
timezone_file(std::optional<std::string_view> tzif_dir, std::string_view timezone_name)
{
using std::ios_base;

// Open the input file
std::string const tz_filename = tzif_system_directory + timezone_name;
auto const tz_filename =
std::filesystem::path{tzif_dir.value_or(tzif_system_directory)} / timezone_name;
std::ifstream fin;
fin.open(tz_filename, ios_base::in | ios_base::binary | ios_base::ate);
CUDF_EXPECTS(fin, "Failed to open the timezone file.");
Expand Down Expand Up @@ -375,45 +376,62 @@ static int64_t get_transition_time(dst_transition_s const& trans, int year)
return trans.time + cuda::std::chrono::duration_cast<duration_s>(duration_D{day}).count();
}

timezone_table build_timezone_transition_table(std::string const& timezone_name,
rmm::cuda_stream_view stream)
} // namespace

std::unique_ptr<table> make_timezone_transition_table(std::optional<std::string_view> tzif_dir,
std::string_view timezone_name,
rmm::mr::device_memory_resource* mr)
{
CUDF_FUNC_RANGE();
return detail::make_timezone_transition_table(
tzif_dir, timezone_name, cudf::get_default_stream(), mr);
}

namespace detail {

std::unique_ptr<table> make_timezone_transition_table(std::optional<std::string_view> tzif_dir,
std::string_view timezone_name,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
if (timezone_name == "UTC" || timezone_name.empty()) {
// Return an empty table for UTC
return {};
return std::make_unique<cudf::table>();
}

timezone_file const tzf(timezone_name);
timezone_file const tzf(tzif_dir, timezone_name);

std::vector<int64_t> ttimes(1);
std::vector<int32_t> offsets(1);
std::vector<timestamp_s::rep> transition_times(1);
std::vector<duration_s::rep> offsets(1);
// One ancient rule entry, one per TZ file entry, 2 entries per year in the future cycle
ttimes.reserve(1 + tzf.timecnt() + cycle_entry_cnt);
offsets.reserve(1 + tzf.timecnt() + cycle_entry_cnt);
transition_times.reserve(1 + tzf.timecnt() + solar_cycle_entry_count);
offsets.reserve(1 + tzf.timecnt() + solar_cycle_entry_count);
size_t earliest_std_idx = 0;
for (size_t t = 0; t < tzf.timecnt(); t++) {
auto const ttime = tzf.transition_times[t];
auto const idx = tzf.ttime_idx[t];
CUDF_EXPECTS(idx < tzf.typecnt(), "Out-of-range type index");
auto const utcoff = tzf.ttype[idx].utcoff;
ttimes.push_back(ttime);
transition_times.push_back(ttime);
offsets.push_back(utcoff);
if (!earliest_std_idx && !tzf.ttype[idx].isdst) { earliest_std_idx = ttimes.size() - 1; }
if (!earliest_std_idx && !tzf.ttype[idx].isdst) {
earliest_std_idx = transition_times.size() - 1;
}
}

if (tzf.timecnt() != 0) {
if (!earliest_std_idx) { earliest_std_idx = 1; }
ttimes[0] = ttimes[earliest_std_idx];
offsets[0] = offsets[earliest_std_idx];
transition_times[0] = transition_times[earliest_std_idx];
offsets[0] = offsets[earliest_std_idx];
} else {
if (tzf.typecnt() == 0 || tzf.ttype[0].utcoff == 0) {
// No transitions, offset is zero; Table would be a no-op.
// Return an empty table to speed up parsing.
return {};
return std::make_unique<cudf::table>();
}
// No transitions to use for the time/offset - use the first offset and apply to all timestamps
ttimes[0] = std::numeric_limits<int64_t>::max();
offsets[0] = tzf.ttype[0].utcoff;
transition_times[0] = std::numeric_limits<int64_t>::max();
offsets[0] = tzf.ttype[0].utcoff;
}

// Generate entries for times after the last transition
Expand Down Expand Up @@ -442,19 +460,19 @@ timezone_table build_timezone_transition_table(std::string const& timezone_name,

// Add entries to fill the transition cycle
int64_t year_timestamp = 0;
for (int32_t year = 1970; year < 1970 + cycle_years; ++year) {
for (int32_t year = 1970; year < 1970 + solar_cycle_years; ++year) {
auto const dst_start_time = get_transition_time(dst_start, year);
auto const dst_end_time = get_transition_time(dst_end, year);

// Two entries per year, since there are two transitions
ttimes.push_back(year_timestamp + dst_start_time - future_std_offset);
transition_times.push_back(year_timestamp + dst_start_time - future_std_offset);
offsets.push_back(future_dst_offset);
ttimes.push_back(year_timestamp + dst_end_time - future_dst_offset);
transition_times.push_back(year_timestamp + dst_end_time - future_dst_offset);
offsets.push_back(future_std_offset);

// Swap the newly added transitions if in descending order
if (ttimes.rbegin()[1] > ttimes.rbegin()[0]) {
std::swap(ttimes.rbegin()[0], ttimes.rbegin()[1]);
if (transition_times.rbegin()[1] > transition_times.rbegin()[0]) {
std::swap(transition_times.rbegin()[0], transition_times.rbegin()[1]);
std::swap(offsets.rbegin()[0], offsets.rbegin()[1]);
}

Expand All @@ -463,15 +481,33 @@ timezone_table build_timezone_transition_table(std::string const& timezone_name,
.count();
}

rmm::device_uvector<int64_t> d_ttimes =
cudf::detail::make_device_uvector_async(ttimes, stream, rmm::mr::get_current_device_resource());
rmm::device_uvector<int32_t> d_offsets = cudf::detail::make_device_uvector_async(
offsets, stream, rmm::mr::get_current_device_resource());
auto const gmt_offset = get_gmt_offset(ttimes, offsets, orc_utc_offset);
CUDF_EXPECTS(transition_times.size() == offsets.size(),
"Error reading TZif file for timezone " + std::string{timezone_name});

std::vector<timestamp_s> ttimes_typed;
ttimes_typed.reserve(transition_times.size());
std::transform(transition_times.cbegin(),
transition_times.cend(),
std::back_inserter(ttimes_typed),
[](auto ts) { return timestamp_s{duration_s{ts}}; });
std::vector<duration_s> offsets_typed;
offsets_typed.reserve(offsets.size());
std::transform(offsets.cbegin(), offsets.cend(), std::back_inserter(offsets_typed), [](auto ts) {
return duration_s{ts};
});

auto d_ttimes = cudf::detail::make_device_uvector_async(ttimes_typed, stream, mr);
auto d_offsets = cudf::detail::make_device_uvector_async(offsets_typed, stream, mr);

std::vector<std::unique_ptr<column>> tz_table_columns;
tz_table_columns.emplace_back(std::make_unique<cudf::column>(std::move(d_ttimes)));
tz_table_columns.emplace_back(std::make_unique<cudf::column>(std::move(d_offsets)));

// Need to finish copies before transition_times and offsets go out of scope
stream.synchronize();

return {gmt_offset, std::move(d_ttimes), std::move(d_offsets)};
return std::make_unique<cudf::table>(std::move(tz_table_columns));
}

} // namespace io
} // namespace detail
} // namespace cudf
2 changes: 2 additions & 0 deletions cpp/src/io/orc/orc.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ namespace io {
namespace orc {

static constexpr uint32_t block_header_size = 3;
// Seconds from January 1st, 1970 to January 1st, 2015
static constexpr int64_t orc_utc_epoch = 1420070400;

struct PostScript {
uint64_t footerLength = 0; // the length of the footer section in bytes
Expand Down
Loading