Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make timezone table independent from ORC #12805

Merged
merged 48 commits into from
Mar 22, 2023
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
c998647
remove gmt_offset
vuule Feb 16, 2023
623d25c
minor clean up
vuule Feb 16, 2023
c5518ef
another one
vuule Feb 16, 2023
979ea3f
return table!
vuule Feb 17, 2023
328db73
create tz table content properly
vuule Feb 17, 2023
640abd6
take timestamp_s in get_gmt_offset
vuule Feb 17, 2023
6e30578
return duration
vuule Feb 17, 2023
9ae2a26
use chrono types in ORC
vuule Feb 18, 2023
76d0ae1
add path param
vuule Feb 18, 2023
2f09f59
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Feb 18, 2023
449648d
style
vuule Feb 18, 2023
f92f1e1
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Feb 28, 2023
f05cc01
move declarations to /include
vuule Feb 28, 2023
7c5e32c
drop cont&
vuule Feb 28, 2023
ebca7ab
more param clean up
vuule Feb 28, 2023
dcb7179
lil fix
vuule Feb 28, 2023
e3a656a
style
vuule Feb 28, 2023
df181fa
Merge branch 'branch-23.04' into timezone-refactor
vuule Feb 28, 2023
b6ec20c
lambda
vuule Feb 28, 2023
df54ee2
Merge branch 'timezone-refactor' of https://github.com/vuule/cudf int…
vuule Feb 28, 2023
2895a29
Merge branch 'branch-23.04' into timezone-refactor
vuule Mar 4, 2023
3fcde67
Merge branch 'branch-23.04' into timezone-refactor
vuule Mar 6, 2023
9611bda
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 6, 2023
49ffb2c
move out of detail
vuule Mar 6, 2023
104a32c
Merge branch 'timezone-refactor' of https://github.com/vuule/cudf int…
vuule Mar 6, 2023
dbfc69d
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 6, 2023
3d86058
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 6, 2023
b90dd7f
fix includes
vuule Mar 6, 2023
e55b950
style
vuule Mar 6, 2023
b13457a
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 9, 2023
38efd05
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 13, 2023
bb172e4
improve naming
vuule Mar 13, 2023
1285488
typo
vuule Mar 13, 2023
2216ccd
moar clarifications!
vuule Mar 13, 2023
9b517b6
move constant to common header
vuule Mar 13, 2023
675b9d7
yaml sort
vuule Mar 13, 2023
3cf32ee
use UT instead of GMT
vuule Mar 13, 2023
3b52745
fix
vuule Mar 13, 2023
d326b69
naming
vuule Mar 14, 2023
3e28bdf
change example
vuule Mar 14, 2023
9050605
clarify obscure calc
vuule Mar 14, 2023
af02c84
separate detail API
vuule Mar 14, 2023
b0a7b10
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 20, 2023
4407b4f
Merge branch 'branch-23.04' of https://github.com/rapidsai/cudf into …
vuule Mar 21, 2023
bd9943f
move timezone.cpp out of ORC
vuule Mar 21, 2023
69dd660
make_timezone_transition_table comment
vuule Mar 21, 2023
eb05227
Merge branch 'branch-23.04' into timezone-refactor
vuule Mar 21, 2023
a5709e1
type fix
vuule Mar 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ outputs:
- test -f $PREFIX/include/cudf/concatenate.hpp
- test -f $PREFIX/include/cudf/copying.hpp
- test -f $PREFIX/include/cudf/datetime.hpp
- test -f $PREFIX/include/cudf/timezone.hpp
- test -f $PREFIX/include/cudf/detail/aggregation/aggregation.hpp
- test -f $PREFIX/include/cudf/detail/aggregation/result_cache.hpp
- test -f $PREFIX/include/cudf/detail/binaryop.hpp
Expand Down Expand Up @@ -122,6 +123,8 @@ outputs:
- test -f $PREFIX/include/cudf/detail/stream_compaction.hpp
- test -f $PREFIX/include/cudf/detail/structs/utilities.hpp
- test -f $PREFIX/include/cudf/detail/tdigest/tdigest.hpp
- test -f $PREFIX/include/cudf/detail/timezone.cuh
- test -f $PREFIX/include/cudf/detail/timezone.hpp
- test -f $PREFIX/include/cudf/detail/transform.hpp
- test -f $PREFIX/include/cudf/detail/transpose.hpp
- test -f $PREFIX/include/cudf/detail/unary.hpp
Expand Down
79 changes: 79 additions & 0 deletions cpp/include/cudf/detail/timezone.cuh
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/table/table_device_view.cuh>
#include <cudf/timezone.hpp>
#include <cudf/types.hpp>
#include <cudf/utilities/span.hpp>

#include <thrust/binary_search.h>
#include <thrust/execution_policy.h>

namespace cudf::detail {

/**
* @brief Returns the UT offset for a given date and given timezone table.
*
* @param transition_times Transition times; trailing `solar_cycle_entry_count` entries are used for
* all times beyond the one covered by the TZif file
* @param offsets Time offsets in specific intervals; trailing `solar_cycle_entry_count` entries are
* used for all times beyond the one covered by the TZif file
* @param ts ORC timestamp
*
* @return offset from UT, in seconds
*/
inline __device__ duration_s get_ut_offset(table_device_view tz_table, timestamp_s ts)
{
if (tz_table.num_rows() == 0) { return duration_s{0}; }

cudf::device_span<timestamp_s const> transition_times(tz_table.column(0).head<timestamp_s>(),
static_cast<size_t>(tz_table.num_rows()));

auto const ts_ttime_it = [&]() {
auto last_less_equal = [](auto begin, auto end, auto value) {
auto const first_larger = thrust::upper_bound(thrust::seq, begin, end, value);
// Return start of the range if all elements are larger than the value
if (first_larger == begin) return begin;
// Element before the first larger element is the last one less or equal
return first_larger - 1;
};

auto const file_entry_end =
transition_times.begin() + (transition_times.size() - solar_cycle_entry_count);

if (ts <= *(file_entry_end - 1)) {
// Search the file entries if the timestamp is in range
return last_less_equal(transition_times.begin(), file_entry_end, ts);
} else {
auto project_to_cycle = [](timestamp_s ts) {
// Years divisible by four are leap years
// Exceptions are years divisible by 100, but not divisible by 400
static constexpr int32_t num_leap_years_in_cycle =
solar_cycle_years / 4 - (solar_cycle_years / 100 - solar_cycle_years / 400);
static constexpr duration_s cycle_s = cuda::std::chrono::duration_cast<duration_s>(
duration_D{365 * solar_cycle_years + num_leap_years_in_cycle});
return timestamp_s{(ts.time_since_epoch() + cycle_s) % cycle_s};
};
// Search the 400-year cycle if outside of the file entries range
return last_less_equal(file_entry_end, transition_times.end(), project_to_cycle(ts));
}
}();

return tz_table.column(1).element<duration_s>(ts_ttime_it - transition_times.begin());
}

} // namespace cudf::detail
36 changes: 36 additions & 0 deletions cpp/include/cudf/detail/timezone.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/timezone.hpp>

#include <rmm/cuda_stream_view.hpp>

namespace cudf::detail {

/**
* @copydoc cudf::make_timezone_transition_table(std::optional<std::string_view>, std::string_view,
* rmm::mr::device_memory_resource*)
*
* @param stream CUDA stream used for device memory operations and kernel launches.
*/
std::unique_ptr<table> make_timezone_transition_table(
std::optional<std::string_view> tzif_dir,
std::string_view timezone_name,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

} // namespace cudf::detail
49 changes: 49 additions & 0 deletions cpp/include/cudf/timezone.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <rmm/mr/device/per_device_resource.hpp>

#include <memory>
#include <optional>
#include <string>

namespace cudf {
class table;

// Cycle in which the time offsets repeat in Gregorian calendar
static constexpr int32_t solar_cycle_years = 400;
// Number of future entries in the timezone transition table:
// Two entries per year, over the length of the Gregorian calendar's solar cycle
vuule marked this conversation as resolved.
Show resolved Hide resolved
static constexpr uint32_t solar_cycle_entry_count = 2 * solar_cycle_years;

/**
* @brief Creates a transition table to convert ORC timestamps to UTC.
*
* Uses system's TZif files. Assumes little-endian platform when parsing these files.
*
* @param tzif_dir The directory where the TZif files are located
* @param timezone_name standard timezone name (for example, "America/Los_Angeles")
* @param mr Device memory resource used to allocate the returned table's device memory.
*
* @return The transition table for the given timezone
*/
std::unique_ptr<table> make_timezone_transition_table(
std::optional<std::string_view> tzif_dir,
std::string_view timezone_name,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

} // namespace cudf
4 changes: 3 additions & 1 deletion cpp/src/io/orc/orc.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2022, NVIDIA CORPORATION.
* Copyright (c) 2019-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -38,6 +38,8 @@ namespace io {
namespace orc {

static constexpr uint32_t block_header_size = 3;
// Seconds from January 1st, 1970 to January 1st, 2015
static constexpr int64_t orc_utc_epoch = 1420070400;

struct PostScript {
uint64_t footerLength = 0; // the length of the footer section in bytes
Expand Down
4 changes: 2 additions & 2 deletions cpp/src/io/orc/orc_gpu.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

#pragma once

#include "timezone.cuh"
#include <cudf/detail/timezone.cuh>

#include "orc.hpp"

Expand Down Expand Up @@ -294,7 +294,7 @@ void DecodeOrcColumnData(ColumnDesc* chunks,
uint32_t num_columns,
uint32_t num_stripes,
size_t first_row,
timezone_table_view tz_table,
table_device_view tz_table,
uint32_t num_rowgroups,
uint32_t rowidx_stride,
size_t level,
Expand Down
17 changes: 9 additions & 8 deletions cpp/src/io/orc/reader_impl.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,13 @@
#include "orc_gpu.hpp"

#include "reader_impl.hpp"
#include "timezone.cuh"

#include <io/comp/gpuinflate.hpp>
#include <io/comp/nvcomp_adapter.hpp>
#include <io/utilities/config_utils.hpp>
#include <io/utilities/time_utils.cuh>

#include <cudf/detail/timezone.hpp>
#include <cudf/detail/utilities/integer_utils.hpp>
#include <cudf/detail/utilities/vector_factories.hpp>
#include <cudf/table/table.hpp>
Expand Down Expand Up @@ -603,7 +603,7 @@ void scan_null_counts(cudf::detail::hostdevice_2dvector<gpu::ColumnDesc> const&
void reader::impl::decode_stream_data(cudf::detail::hostdevice_2dvector<gpu::ColumnDesc>& chunks,
size_t num_dicts,
size_t skip_rows,
timezone_table_view tz_table,
table_device_view tz_table,
cudf::detail::hostdevice_2dvector<gpu::RowGroup>& row_groups,
size_t row_index_stride,
std::vector<column_buffer>& out_buffers,
Expand Down Expand Up @@ -915,22 +915,22 @@ reader::impl::impl(std::vector<std::unique_ptr<datasource>>&& sources,
decimal128_columns = options.get_decimal128_columns();
}

timezone_table reader::impl::compute_timezone_table(
std::unique_ptr<table> reader::impl::compute_timezone_table(
const std::vector<cudf::io::orc::metadata::stripe_source_mapping>& selected_stripes,
rmm::cuda_stream_view stream)
{
if (selected_stripes.empty()) return {};
if (selected_stripes.empty()) return std::make_unique<cudf::table>();

auto const has_timestamp_column = std::any_of(
selected_columns.levels.cbegin(), selected_columns.levels.cend(), [&](auto& col_lvl) {
return std::any_of(col_lvl.cbegin(), col_lvl.cend(), [&](auto& col_meta) {
return _metadata.get_col_type(col_meta.id).kind == TypeKind::TIMESTAMP;
});
});
if (not has_timestamp_column) return {};
if (not has_timestamp_column) return std::make_unique<cudf::table>();

return build_timezone_transition_table(selected_stripes[0].stripe_info[0].second->writerTimezone,
stream);
return cudf::detail::make_timezone_transition_table(
{}, selected_stripes[0].stripe_info[0].second->writerTimezone, stream);
}

table_with_metadata reader::impl::read(size_type skip_rows,
Expand Down Expand Up @@ -1238,10 +1238,11 @@ table_with_metadata reader::impl::read(size_type skip_rows,
}

if (not is_level_data_empty) {
auto const tz_table_dview = table_device_view::create(tz_table->view(), stream);
decode_stream_data(chunks,
num_dict_entries,
skip_rows,
tz_table.view(),
*tz_table_dview,
row_groups,
_metadata.get_row_index_stride(),
out_buffers[level],
Expand Down
6 changes: 3 additions & 3 deletions cpp/src/io/orc/reader_impl.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2022, NVIDIA CORPORATION.
* Copyright (c) 2019-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -145,7 +145,7 @@ class reader::impl {
void decode_stream_data(cudf::detail::hostdevice_2dvector<gpu::ColumnDesc>& chunks,
size_t num_dicts,
size_t skip_rows,
timezone_table_view tz_table,
table_device_view tz_table,
cudf::detail::hostdevice_2dvector<gpu::RowGroup>& row_groups,
size_t row_index_stride,
std::vector<column_buffer>& out_buffers,
Expand Down Expand Up @@ -210,7 +210,7 @@ class reader::impl {
*
* @return Timezone table with timestamp offsets
*/
timezone_table compute_timezone_table(
std::unique_ptr<table> compute_timezone_table(
const std::vector<cudf::io::orc::metadata::stripe_source_mapping>& selected_stripes,
rmm::cuda_stream_view stream);

Expand Down
Loading