Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: investigate using a bitarray as the mask in the nullable/masked ExtensionArrays #31293

Open
jorisvandenbossche opened this issue Jan 24, 2020 · 4 comments
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Discussion Requires discussion from core team before further action

Comments

@jorisvandenbossche
Copy link
Member

Currently, our nullable / masked extension arrays (boolean, integer, for now) are using a numpy boolean array as their _mask to keep track of missing values. A potential route for improving memory and performance would be using a bitarray instead of a boolean numpy array (which is a byte per value).

This should require some exploration: what are options how to implement this? (existing libraries, custom implementation) What is the performance impact? (some things like masking will also be slower, since we still rely on numpy for that, which needs boolean arrays) Is this worth it to do a custom implementation rather than using pyarrow for this? etc

@jorisvandenbossche jorisvandenbossche added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 24, 2020
@jbrockmendel
Copy link
Member

xref #21839.

I asked @seberg if numpy had thought about implementing this directly and got the impression it had been discussed but never as a priority. I'm sure we could implement this here, but it definitely seems like a better fit for numpy.

Does pyarrow have an implementation? (or will it in the not-too-distant future?)

@seberg
Copy link
Contributor

seberg commented Jan 25, 2020

The issue with it within NumPy is simple. NumPy has a strided memory layout, which implies byte sized strides to reach elements. To do this we would have to ensure that everything inside NumPy can somehow deal with these sub-byte strides (when the dtype or a flag requests it).

It could be interesting to explore, but it seems like it likely is just a can of worms, especially since strides are somewhat public API. Making a bitarray class behaves mostly like an ndarray may be better/easier, although not sure how complete that has to be to be good enough for this use case.

@jorisvandenbossche
Copy link
Member Author

Also, for pandas we "just" need a 1D bitarray, which I assume will make it a lot simpler than the full multidimensional requirements for numpy.


In #22238, @WillAyd used https://github.com/ilanschnell/bitarray

@jreback
Copy link
Contributor

jreback commented Jan 25, 2020

and i made it work here: #25415

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants