Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Arrow dependency from the datasource.hpp public header #13698

Merged
merged 18 commits into from
Aug 10, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 6 additions & 14 deletions cpp/include/cudf/io/arrow_io_source.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,20 +19,14 @@
#include "datasource.hpp"

#include <arrow/buffer.h>

#include <arrow/filesystem/filesystem.h>
#include <arrow/filesystem/s3fs.h>
#include <arrow/io/file.h>
#include <arrow/io/interfaces.h>
#include <arrow/io/memory.h>
#include <arrow/result.h>
#include <arrow/status.h>

#include <memory>
#include <string>

namespace cudf::io {

/**
* @addtogroup io_datasources
* @{
Expand Down Expand Up @@ -65,13 +59,12 @@ class arrow_io_source : public datasource {
*
* @param arrow_uri Apache Arrow Filesystem URI
*/
explicit arrow_io_source(std::string_view arrow_uri)
explicit arrow_io_source(std::string const& arrow_uri)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same to all other member functions. Maybe considering separating class declaration from definition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to keep this class separate from libcudf, but we don't have to take that route.
@davidwendt thoughts on keeping the implementation in libcudf?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you mean by keeping this separate from libcudf.
I think Yunsong means moves the implementation to cpp/src/io/arrow_io_source.cpp and keep only the declarations in this header file. I agree with that approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that I will need to compile cpp/src/io/arrow_io_source.cpp as part of libcudf. Not an issue, just different from what I imagined in the beginning.

{
std::string const uri_start_delimiter = "//";
std::string const uri_end_delimiter = "?";

arrow::Result<std::shared_ptr<arrow::fs::FileSystem>> result =
arrow::fs::FileSystemFromUri(static_cast<std::string>(arrow_uri));
auto const result = arrow::fs::FileSystemFromUri(arrow_uri);
CUDF_EXPECTS(result.ok(), "Failed to generate Arrow Filesystem instance from URI.");
filesystem = result.ValueOrDie();

Expand All @@ -82,8 +75,7 @@ class arrow_io_source : public datasource {
size_t end = arrow_uri.find(uri_end_delimiter) - start;
std::string_view path = arrow_uri.substr(start, end);

arrow::Result<std::shared_ptr<arrow::io::RandomAccessFile>> in_stream =
filesystem->OpenInputFile(static_cast<std::string>(path).c_str());
auto const in_stream = filesystem->OpenInputFile(static_cast<std::string>(path).c_str());
CUDF_EXPECTS(in_stream.ok(), "Failed to open Arrow RandomAccessFile");
arrow_file = in_stream.ValueOrDie();
}
Expand All @@ -104,7 +96,7 @@ class arrow_io_source : public datasource {
*/
std::unique_ptr<buffer> host_read(size_t offset, size_t size) override
{
auto result = arrow_file->ReadAt(offset, size);
auto const result = arrow_file->ReadAt(offset, size);
CUDF_EXPECTS(result.ok(), "Cannot read file data");
return std::make_unique<arrow_io_buffer>(result.ValueOrDie());
}
Expand All @@ -119,7 +111,7 @@ class arrow_io_source : public datasource {
*/
size_t host_read(size_t offset, size_t size, uint8_t* dst) override
{
auto result = arrow_file->ReadAt(offset, size, dst);
auto const result = arrow_file->ReadAt(offset, size, dst);
CUDF_EXPECTS(result.ok(), "Cannot read file data");
return result.ValueOrDie();
}
Expand All @@ -131,7 +123,7 @@ class arrow_io_source : public datasource {
*/
[[nodiscard]] size_t size() const override
{
auto result = arrow_file->GetSize();
auto const result = arrow_file->GetSize();
CUDF_EXPECTS(result.ok(), "Cannot get file size");
return result.ValueOrDie();
}
Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/_lib/cpp/io/arrow_io_source.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@ cdef extern from "cudf/io/arrow_io_source.hpp" \
namespace "cudf::io" nogil:

cdef cppclass arrow_io_source(cudf_io_datasource.datasource):
arrow_io_source(string arrow_uri) except +
arrow_io_source(const string& arrow_uri) except +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

east const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I only see "const string" in pxd files so I kept this consistent with other instances. I'm not sure if our east const guideline reached Cython.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython doesn't support east const.

arrow_io_source(shared_ptr[CRandomAccessFile]) except +