New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[REVIEW] Expose the HOST_BUFFER parquet sink to python #5094

Closed

benfred wants to merge 3 commits into rapidsai:branch-0.15 from benfred:parquet_hostbuffer

Member

benfred commented May 5, 2020

Expose the HOST_BUFFER sink type to python for parquet files.
(As discussed here: #5061 (comment) )


          Expose the HOST_BUFFER parquet sink to python

68d1841

Expose the HOST_BUFFER sink type to python for parquet files.
(rapidsai#5061 (comment))

benfred requested a review from a team as a code owner

May 5, 2020 04:12

Collaborator

GPUtester commented May 5, 2020

Please update the changelog in order to start CI tests.

View the gpuCI docs here.


          Update changelog

46c10a8

benfred changed the title ~~[WIP] Expose the HOST_BUFFER parquet sink to python~~ [REVIEW] Expose the HOST_BUFFER parquet sink to python


          isort

1f17eb1

codecov bot commented May 5, 2020 •

edited

Loading

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.15@08bccff). Click here to learn what that means.
The diff coverage is n/a.

@@              Coverage Diff               @@
##             branch-0.15    #5094   +/-   ##
==============================================
  Coverage               ?   88.44%           
==============================================
  Files                  ?       54           
  Lines                  ?    10267           
  Branches               ?        0           
==============================================
  Hits                   ?     9081           
  Misses                 ?     1186           
  Partials               ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08bccff...1f17eb1. Read the comment docs.

kkraus14 requested changes

View reviewed changes

python/cudf/cudf/io/__init__.py

@@ @@ -13,3 +14,5 @@ @@
                   read_parquet_metadata,
                   write_to_dataset,
               )
+              HostBuffer = cudf._lib.io.utils.HostBuffer

Collaborator

kkraus14 May 5, 2020

Could we just do from cudf._lib.io.utils import HostBuffer?

Member Author

benfred May 7, 2020

I could go from cudf._lib.io.utils import HostBuffer # noqa , but think we'd need the noqa tag otherwise flake8 will complain about unused imports. (I was trying to be consistent here with #4870 (comment) =)

python/cudf/cudf/tests/test_parquet.py

Comment on lines +612 to +621

+              def test_parquet_writer_host_buffer(tmpdir, simple_gdf):
+                  buffer = cudf.io.HostBuffer()
+                  simple_gdf.to_parquet(buffer)
+                  assert_eq(cudf.read_parquet(buffer), simple_gdf)
+                  gdf_fname = tmpdir.join("gdf.parquet")
+                  with open(gdf_fname, "wb") as o:
+                      o.write(buffer.read())
+                  assert_eq(cudf.read_parquet(gdf_fname), simple_gdf)

Collaborator

kkraus14 May 5, 2020

Could we also add a test in reserving the size of the HostBuffer() up front just to make sure that doesn't break things? :)

python/cudf/cudf/_lib/io/utils.pyx

+                    # Write HostBuffer to disk
+                    shutil.copyfileobj(buffer, open("output.parquet", "wb"))
+                  """
+                  cdef vector[char] buf

Collaborator

kkraus14 May 5, 2020

Just a note: RMM will have the ability to allocate pinned host memory in the future which will be faster for moving device <--> host as well as allow things to remain asynchronous.

python/cudf/cudf/_lib/io/utils.pyx

Comment on lines +105 to +106

		if initial_capacity:
		self.buf.reserve(initial_capacity)

Collaborator

kkraus14 May 5, 2020

Don't need to check it here, passing 0 should have no affect:

Suggested change

      
                    if initial_capacity:
          
                        self.buf.reserve(initial_capacity)
          
                    self.buf.reserve(initial_capacity)

python/cudf/cudf/_lib/io/utils.pyx

+                  def read(self, int n=-1):
+                      if self.pos >= self.buf.size():
+                          return b""

Collaborator

kkraus14 May 5, 2020

It's a little bit awkward that this returns an empty bytes object while below returns a MemoryView. Can we return an empty memoryview?

python/cudf/cudf/_lib/io/utils.pyx

@@ @@ -67,3 +72,61 @@ cdef cppclass iobase_data_sink(data_sink): @@
                   size_t bytes_written() with gil:
                       return buf.tell()
+              cdef class HostBuffer:

Collaborator

kkraus14 May 5, 2020

Do we want to consider exposing buffer protocol for this class?

Member

jakirkham May 5, 2020

Should we use this name here? Thought this was going to be reserved for the pinned memory allocation made by RMM (once that arrives). ( rapidsai/rmm#260 )

Collaborator

kkraus14 May 5, 2020

Ah good call, yes we shouldn't use HostBuffer here.

jakirkham reviewed

View reviewed changes

python/cudf/cudf/_lib/io/utils.pyx

+                      self.pos += count
+                      return PyMemoryView_FromMemory(self.buf.data() + start, count,
+                                                     PyBUF_READ)

Member

jakirkham May 5, 2020

Seems better to support the buffer protocol on this class instead. Otherwise this could refer to an invalid memory segment that Python has already garbage collected.

Member Author

benfred May 7, 2020

Sounds good.

We have code already that adapts a c++ byte vector to the buffer protocol here:

cudf/python/cudf/cudf/_lib/parquet.pyx

Lines 49 to 86 in 85ce695

    
           cdef class BufferArrayFromVector: 
        
               cdef Py_ssize_t length 
        
               cdef unique_ptr[vector[uint8_t]] in_vec 
        
               # these two things declare part of the buffer interface 
        
               cdef Py_ssize_t shape[1] 
        
               cdef Py_ssize_t strides[1] 
        
               @staticmethod 
        
               cdef BufferArrayFromVector from_unique_ptr( 
        
                   unique_ptr[vector[uint8_t]] in_vec 
        
               ): 
        
                   cdef BufferArrayFromVector buf = BufferArrayFromVector() 
        
                   buf.in_vec = move(in_vec) 
        
                   buf.length = dereference(buf.in_vec).size() 
        
                   return buf 
        
               def __getbuffer__(self, Py_buffer *buffer, int flags): 
        
                   cdef Py_ssize_t itemsize = sizeof(uint8_t) 
        
                   self.shape[0] = self.length 
        
                   self.strides[0] = 1 
        
                   buffer.buf = dereference(self.in_vec).data() 
        
                   buffer.format = NULL  # byte 
        
                   buffer.internal = NULL 
        
                   buffer.itemsize = itemsize 
        
                   buffer.len = self.length * itemsize   # product(shape) * itemsize 
        
                   buffer.ndim = 1 
        
                   buffer.obj = self 
        
                   buffer.readonly = 0 
        
                   buffer.shape = self.shape 
        
                   buffer.strides = self.strides 
        
                   buffer.suboffsets = NULL 
        
               def __releasebuffer__(self, Py_buffer *buffer): 
        
                   pass

Should we use that class instead of creating a new one? The only change really is between vector<char> expected by the HOST_BUFFER sink and the vector<uint_8> in this class (and I think we could either cast the cpp vector, or use cython fused types to get around).

Collaborator

kkraus14 May 7, 2020

I think we actually have 2 instances of classes like this one where ideally we could standardize them.

harrism added the 3 - Ready for Review label

harrism added Cython Python cuIO labels

harrism changed the base branch from branch-0.14 to branch-0.15

May 28, 2020 01:02

Member

harrism commented May 28, 2020

No updates in a while, retargeting to 0.15.

Collaborator

kkraus14 commented Aug 17, 2020

@benfred I'm closing this for now since it's gone stale, feel free to reopen if / when you're looking to work on it further.

kkraus14 closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review cuIO Python