Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Can't provide Arrow array when filling nulls of a fixed size list array #35624

Open
spenczar opened this issue May 16, 2023 · 3 comments

Comments

@spenczar
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

Suppose you have a fixed-sized list array with null values:

array = pa.nulls(3, pa.list_(pa.float64(), 2))
print(array)
# [
#  null,
#  null,
#  null
#]

How do you fill the nulls? Like, maybe I want it to look like [[null, null], [null, null], [null, null]]. Or even [[0, 0], [0, 0], [0, 0]].

You can't call fill_null with a pyarrow array value, because the arrays don't have an 'as_py' method:

array.fill_null(pa.array([0.0, 0.0]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 1296, in pyarrow.lib.Array.fill_null
  File "/Users/swnelson/scratch/.arrow-venv/lib/python3.11/site-packages/pyarrow/compute.py", line 523, in fill_null
    fill_value = pa.scalar(fill_value.as_py(), type=values.type)
                           ^^^^^^^^^^^^^^^^
AttributeError: 'pyarrow.lib.DoubleArray' object has no attribute 'as_py'

But you can do it with a plain old Python list!

array.fill_null([0.0, 0.0])
<pyarrow.lib.FixedSizeListArray object at 0x135647460>
[
  [
    0,
    0
  ],
  [
    0,
    0
  ],
  [
    0,
    0
  ]
]

This seems like an oversight. If I give an array of the right shape, surely it should be permitted.

Component(s)

Python

@jorisvandenbossche
Copy link
Member

Thanks for the report!

The code in question is:

if not isinstance(fill_value, (pa.Array, pa.ChunkedArray, pa.Scalar)):
fill_value = pa.scalar(fill_value, type=values.type)
elif values.type != fill_value.type:
fill_value = pa.scalar(fill_value.as_py(), type=values.type)
return call_function("coalesce", [values, fill_value])

So if the types don't match, it tries to convert the fill_value to a scalar of the correct type. There seems to be multiple things fishy about this: 1) this should probably cast to the correct type instead of going through as_py, and 2) the current version also only works for scalars, not arrays ..
We can see this with a simple (non-nested array) example as well:

>>> arr = pa.array([1, 2, None, 4, None])
>>> arr.fill_null(pa.array([10, 20, 30, 40, 50]))
<pyarrow.lib.Int64Array object at 0x7f41c56983a0>
[
  1,
  2,
  30,
  4,
  50
]
>>> arr.fill_null(pa.array([10, 20, 30, 40, 50], type="int32"))
...
AttributeError: 'pyarrow.lib.Int32Array' object has no attribute 'as_py'

Now, for your original example using a list array, the situation is a bit more complicated. Because for the case where you pass a pyarrow.Array, the types also don't match (list of float vs float), and automatically casting in that case also wouldn't work.
The reason that the python list works is because we convert that to a scalar of the correct type under the hood (the a.scalar(fill_value, type=values.type) in the code snippet above).

But so if you pass a pyarrow object, I think we will need to require that you pass a "correct" scalar instead (because when passing an array, we assume that this is meant for filling with an array element-wise, and not for filling with a scalar). If you start from a pyarrow array, you can convert this to a scalar of the correct type before passing it to fill_null, but the problem is that this also fails at the moment:

>>> pa.scalar(pa.array([0.0, 0.0]),  pa.list_(pa.float64(), 2))
...
ArrowInvalid: Could not convert <pyarrow.DoubleScalar: 0.0> with type pyarrow.lib.DoubleScalar: tried to convert to double

So that's also something we should fix.

@spenczar
Copy link
Contributor Author

Oh wow, fill_null works quite differently than I expected. Your example of passing it an array and it working elementwise is very surprising to me! fill_value's documentation is a bit misleading, then; it says

Replace each null element in values with fill_value.

But it should say something like "Replace each null element in values with a corresponding element from fill_value."

That's not a perfect phrasing, since it's a bit complicated and hard to fit into a short sentence. As I understand it, the behavior is:

  • If fill_value is scalar-like, replace each null element in values with fill_value.
  • If fill_value is array-like, replace the ith null element in values with the ith element in fill_value.

This issue seems to fracture into three issues:

Does that seem right?

@jorisvandenbossche
Copy link
Member

Oh wow, fill_null works quite differently than I expected. Your example of passing it an array and it working elementwise is very surprising to me! fill_value's documentation is a bit misleading, then; it says

Yeah, that's indeed certainly not properly documented. I am not sure this was fully intentional (maybe when it was added, mostly the scalar fill_value was considered), but this is essentially an alias for coalesce, and it is this function that supports this behaviour.
And also for example fillna in pandas has the same ability of either filling with a scalar or filling with the corresponding values of a same-length array.

Your suggested rewording sounds good!

This issue seems to fracture into three issues:

Yes, that's a good summary.

can't make an explicit scalar element of a fixed-size list (which sounds kind of like #18987 perhaps)

That's indeed a related issue. It's also related to #21761, for accepting pyarrow values in pa.array(..) constructor (and not pa.scalar(..)). For this one, there is an open PR, but it will probably depend on how that PR implements the fix whether that would also fix the scalar() version (we don't call pa.array(..) inside pa.scalar(..), but directly the lower level ConvertPySequence).

pitrou added a commit that referenced this issue May 30, 2023
…35813)

Discussed in #35624, particularly at #35624 (comment). 

Doesn't close that issue, but it came up in discussion.

Lead-authored-by: Spencer Nelson <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants