ENH: investigate using a bitarray as the mask in the nullable/masked ExtensionArrays #31293

jorisvandenbossche · 2020-01-24T19:35:20Z

Currently, our nullable / masked extension arrays (boolean, integer, for now) are using a numpy boolean array as their _mask to keep track of missing values. A potential route for improving memory and performance would be using a bitarray instead of a boolean numpy array (which is a byte per value).

This should require some exploration: what are options how to implement this? (existing libraries, custom implementation) What is the performance impact? (some things like masking will also be slower, since we still rely on numpy for that, which needs boolean arrays) Is this worth it to do a custom implementation rather than using pyarrow for this? etc

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2020-01-25T02:13:13Z

xref #21839.

I asked @seberg if numpy had thought about implementing this directly and got the impression it had been discussed but never as a priority. I'm sure we could implement this here, but it definitely seems like a better fit for numpy.

Does pyarrow have an implementation? (or will it in the not-too-distant future?)

seberg · 2020-01-25T02:31:23Z

The issue with it within NumPy is simple. NumPy has a strided memory layout, which implies byte sized strides to reach elements. To do this we would have to ensure that everything inside NumPy can somehow deal with these sub-byte strides (when the dtype or a flag requests it).

It could be interesting to explore, but it seems like it likely is just a can of worms, especially since strides are somewhat public API. Making a bitarray class behaves mostly like an ndarray may be better/easier, although not sure how complete that has to be to be good enough for this use case.

jorisvandenbossche · 2020-01-25T07:35:40Z

Also, for pandas we "just" need a 1D bitarray, which I assume will make it a lot simpler than the full multidimensional requirements for numpy.

In #22238, @WillAyd used https://github.com/ilanschnell/bitarray

jreback · 2020-01-25T12:50:05Z

and i made it work here: #25415

jorisvandenbossche added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 24, 2020

jorisvandenbossche mentioned this issue Jan 24, 2020

ENH: Allow opting in to new dtypes on I/O routines via keyword to I/O routines #29752

Closed

jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Jan 30, 2020

mroeschke added the Enhancement label Apr 28, 2020

jorisvandenbossche mentioned this issue Jan 8, 2021

ENH: 2D support for MaskedArray #38992

Merged

jbrockmendel mentioned this issue Dec 21, 2021

PERF: better memory footprint for intna #21839

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: investigate using a bitarray as the mask in the nullable/masked ExtensionArrays #31293

ENH: investigate using a bitarray as the mask in the nullable/masked ExtensionArrays #31293

jorisvandenbossche commented Jan 24, 2020

jbrockmendel commented Jan 25, 2020

seberg commented Jan 25, 2020

jorisvandenbossche commented Jan 25, 2020

jreback commented Jan 25, 2020

ENH: investigate using a bitarray as the mask in the nullable/masked ExtensionArrays #31293

ENH: investigate using a bitarray as the mask in the nullable/masked ExtensionArrays #31293

Comments

jorisvandenbossche commented Jan 24, 2020

jbrockmendel commented Jan 25, 2020

seberg commented Jan 25, 2020

jorisvandenbossche commented Jan 25, 2020

jreback commented Jan 25, 2020