Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14495: [Python] Fix DictionaryArray.from_buffers, should not crash #13989

Merged
merged 4 commits into from
Sep 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions python/pyarrow/array.pxi
Original file line number Diff line number Diff line change
Expand Up @@ -2478,6 +2478,53 @@ cdef class DictionaryArray(Array):

return self._indices

@staticmethod
def from_buffers(DataType type, int64_t length, buffers, Array dictionary,
int64_t null_count=-1, int64_t offset=0):
"""
Construct a DictionaryArray from buffers.

Parameters
----------
type : pyarrow.DataType
pitrou marked this conversation as resolved.
Show resolved Hide resolved
length : int
The number of values in the array.
buffers : List[Buffer]
The buffers backing the indices array.
dictionary : pyarrow.Array, ndarray or pandas.Series
The array of values referenced by the indices.
null_count : int, default -1
The number of null entries in the indices array. Negative value means that
the null count is not known.
offset : int, default 0
The array's logical offset (in values, not in bytes) from the
start of each buffer.

Returns
-------
dict_array : DictionaryArray
"""
cdef:
vector[shared_ptr[CBuffer]] c_buffers
shared_ptr[CDataType] c_type
shared_ptr[CArrayData] c_data
shared_ptr[CArray] c_result

for buf in buffers:
c_buffers.push_back(pyarrow_unwrap_buffer(buf))

c_type = pyarrow_unwrap_data_type(type)

with nogil:
c_data = CArrayData.Make(
c_type, length, c_buffers, null_count, offset)
c_data.get().dictionary = dictionary.sp_array.get().data()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a ArrayData::Make variant that directly accepts a dictionary, but that might not be worth exposing for just this one line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that, and it appears to essentially do the same; creating ArrayData then setting the dictionary member. Using it this way, we don't need to pass (null?) child data, albeit, if that way is the preferred method I have no problem switching. 👍

c_result.reset(new CDictionaryArray(c_data))

cdef Array result = pyarrow_wrap_array(c_result)
result.validate()
return result

@staticmethod
def from_arrays(indices, dictionary, mask=None, bint ordered=False,
bint from_pandas=False, bint safe=True,
Expand Down
1 change: 1 addition & 0 deletions python/pyarrow/includes/libarrow.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil:
CDictionaryArray(const shared_ptr[CDataType]& type,
const shared_ptr[CArray]& indices,
const shared_ptr[CArray]& dictionary)
CDictionaryArray(const shared_ptr[CArrayData]& data)

@staticmethod
CResult[shared_ptr[CArray]] FromArrays(
Expand Down
9 changes: 9 additions & 0 deletions python/pyarrow/tests/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -725,6 +725,15 @@ def test_struct_array_from_chunked():
pa.StructArray.from_arrays([chunked_arr], ["foo"])


@pytest.mark.parametrize("offset", (0, 1))
def test_dictionary_from_buffers(offset):
a = pa.array(["one", "two", "three", "two", "one"]).dictionary_encode()
b = pa.DictionaryArray.from_buffers(a.type, len(a)-offset,
a.indices.buffers(), a.dictionary,
offset=offset)
assert a[offset:] == b

milesgranger marked this conversation as resolved.
Show resolved Hide resolved

def test_dictionary_from_numpy():
indices = np.repeat([0, 1, 2], 2)
dictionary = np.array(['foo', 'bar', 'baz'], dtype=object)
Expand Down