Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Expose the HOST_BUFFER parquet sink to python #5094

Closed
wants to merge 3 commits into from

Conversation

benfred
Copy link
Member

@benfred benfred commented May 5, 2020

Expose the HOST_BUFFER sink type to python for parquet files.
(As discussed here: #5061 (comment) )

Expose the HOST_BUFFER sink type to python for parquet files.
(rapidsai#5061 (comment))
@benfred benfred requested a review from a team as a code owner May 5, 2020 04:12
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@benfred benfred changed the title [WIP] Expose the HOST_BUFFER parquet sink to python [REVIEW] Expose the HOST_BUFFER parquet sink to python May 5, 2020
@codecov
Copy link

codecov bot commented May 5, 2020

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.15@08bccff). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff               @@
##             branch-0.15    #5094   +/-   ##
==============================================
  Coverage               ?   88.44%           
==============================================
  Files                  ?       54           
  Lines                  ?    10267           
  Branches               ?        0           
==============================================
  Hits                   ?     9081           
  Misses                 ?     1186           
  Partials               ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08bccff...1f17eb1. Read the comment docs.

@@ -13,3 +14,5 @@
read_parquet_metadata,
write_to_dataset,
)

HostBuffer = cudf._lib.io.utils.HostBuffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just do from cudf._lib.io.utils import HostBuffer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go from cudf._lib.io.utils import HostBuffer # noqa , but think we'd need the noqa tag otherwise flake8 will complain about unused imports. (I was trying to be consistent here with #4870 (comment) =)

Comment on lines +612 to +621
def test_parquet_writer_host_buffer(tmpdir, simple_gdf):
buffer = cudf.io.HostBuffer()
simple_gdf.to_parquet(buffer)

assert_eq(cudf.read_parquet(buffer), simple_gdf)

gdf_fname = tmpdir.join("gdf.parquet")
with open(gdf_fname, "wb") as o:
o.write(buffer.read())
assert_eq(cudf.read_parquet(gdf_fname), simple_gdf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add a test in reserving the size of the HostBuffer() up front just to make sure that doesn't break things? :)

# Write HostBuffer to disk
shutil.copyfileobj(buffer, open("output.parquet", "wb"))
"""
cdef vector[char] buf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: RMM will have the ability to allocate pinned host memory in the future which will be faster for moving device <--> host as well as allow things to remain asynchronous.

Comment on lines +105 to +106
if initial_capacity:
self.buf.reserve(initial_capacity)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to check it here, passing 0 should have no affect:

Suggested change
if initial_capacity:
self.buf.reserve(initial_capacity)
self.buf.reserve(initial_capacity)


def read(self, int n=-1):
if self.pos >= self.buf.size():
return b""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little bit awkward that this returns an empty bytes object while below returns a MemoryView. Can we return an empty memoryview?

@@ -67,3 +72,61 @@ cdef cppclass iobase_data_sink(data_sink):

size_t bytes_written() with gil:
return buf.tell()


cdef class HostBuffer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to consider exposing buffer protocol for this class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use this name here? Thought this was going to be reserved for the pinned memory allocation made by RMM (once that arrives). ( rapidsai/rmm#260 )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good call, yes we shouldn't use HostBuffer here.

self.pos += count

return PyMemoryView_FromMemory(self.buf.data() + start, count,
PyBUF_READ)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems better to support the buffer protocol on this class instead. Otherwise this could refer to an invalid memory segment that Python has already garbage collected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

We have code already that adapts a c++ byte vector to the buffer protocol here:

cdef class BufferArrayFromVector:
cdef Py_ssize_t length
cdef unique_ptr[vector[uint8_t]] in_vec
# these two things declare part of the buffer interface
cdef Py_ssize_t shape[1]
cdef Py_ssize_t strides[1]
@staticmethod
cdef BufferArrayFromVector from_unique_ptr(
unique_ptr[vector[uint8_t]] in_vec
):
cdef BufferArrayFromVector buf = BufferArrayFromVector()
buf.in_vec = move(in_vec)
buf.length = dereference(buf.in_vec).size()
return buf
def __getbuffer__(self, Py_buffer *buffer, int flags):
cdef Py_ssize_t itemsize = sizeof(uint8_t)
self.shape[0] = self.length
self.strides[0] = 1
buffer.buf = dereference(self.in_vec).data()
buffer.format = NULL # byte
buffer.internal = NULL
buffer.itemsize = itemsize
buffer.len = self.length * itemsize # product(shape) * itemsize
buffer.ndim = 1
buffer.obj = self
buffer.readonly = 0
buffer.shape = self.shape
buffer.strides = self.strides
buffer.suboffsets = NULL
def __releasebuffer__(self, Py_buffer *buffer):
pass

Should we use that class instead of creating a new one? The only change really is between vector<char> expected by the HOST_BUFFER sink and the vector<uint_8> in this class (and I think we could either cast the cpp vector, or use cython fused types to get around).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually have 2 instances of classes like this one where ideally we could standardize them.

@harrism harrism added the 3 - Ready for Review Ready for review by team label May 12, 2020
@harrism harrism added Cython Python Affects Python cuDF API. cuIO cuIO issue labels May 12, 2020
@harrism harrism changed the base branch from branch-0.14 to branch-0.15 May 28, 2020 01:02
@harrism
Copy link
Member

harrism commented May 28, 2020

No updates in a while, retargeting to 0.15.

@kkraus14
Copy link
Collaborator

@benfred I'm closing this for now since it's gone stale, feel free to reopen if / when you're looking to work on it further.

@kkraus14 kkraus14 closed this Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants