Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] cuio: consolidate host decompression #6188

Closed
cwharris opened this issue Sep 9, 2020 · 2 comments
Closed

[FEA] cuio: consolidate host decompression #6188

cwharris opened this issue Sep 9, 2020 · 2 comments
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@cwharris
Copy link
Contributor

cwharris commented Sep 9, 2020

Host decompression is used by json/csv readers immediately after datasource read. Rather than being a part of each reader/writer, we could consolidate them in to the datasource, or a decorator of some kind. This would simplify the readers, and also enable decouple decompression from range_offset and range_size.

This might become more straightforward after #6185

Aside from keeping things DRY, abstracting the host decompression away from the readers may allow use it in more readers.

Examples:

ingest_raw_input(range_offset, range_size);
CUDF_EXPECTS(buffer_ != nullptr, "Ingest failed: input data is null.\n");
decompress_input(stream);

void reader::impl::ingest_raw_input(size_t range_offset, size_t range_size)
{
size_t map_range_size = 0;
if (range_size != 0) { map_range_size = range_size + calculate_max_row_size(args_.dtype.size()); }
// Support delayed opening of the file if using memory mapping datasource
// This allows only mapping of a subset of the file if using byte range
if (source_ == nullptr) {
assert(!filepath_.empty());
source_ = datasource::create(filepath_, range_offset, map_range_size);
}
if (!source_->is_empty()) {
auto data_size = (map_range_size != 0) ? map_range_size : source_->size();
buffer_ = source_->host_read(range_offset, data_size);
}
byte_range_offset_ = range_offset;
byte_range_size_ = range_size;
load_whole_file_ = byte_range_offset_ == 0 && byte_range_size_ == 0;
}

void reader::impl::decompress_input(cudaStream_t stream)
{
const auto compression_type = infer_compression_type(
args_.compression, filepath_, {{"gz", "gzip"}, {"zip", "zip"}, {"bz2", "bz2"}, {"xz", "xz"}});
if (compression_type == "none") {
// Do not use the owner vector here to avoid extra copy
uncomp_data_ = reinterpret_cast<const char *>(buffer_->data());
uncomp_size_ = buffer_->size();
} else {
uncomp_data_owner_ = getUncompressedHostData(
reinterpret_cast<const char *>(buffer_->data()), buffer_->size(), compression_type);
uncomp_data_ = uncomp_data_owner_.data();
uncomp_size_ = uncomp_data_owner_.size();
}
if (load_whole_file_) data_ = rmm::device_buffer(uncomp_data_, uncomp_size_, stream);
}

@cwharris cwharris added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue tech debt labels Sep 9, 2020
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Sep 15, 2020
@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@vuule
Copy link
Contributor

vuule commented May 19, 2021

@cwharris should we close this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

4 participants