lazy array attributes #27

jreback · 2016-09-21T10:22:05Z

IIRC this from the design docs, but wanted to make an issue to remember. We want to have a set of lazily computed array attributes. Sometimes these can be set at creation time based on the creation method / dtype. If the array is immutable then these are not affected by indexing checks.

immutability / read-only, xref block mutation of read-only array in series pandas-dev/pandas#14359
unique
is_monotonic*
has_nulls
is_hashable - only on non homogeneous dtype, usually true on object dtypes (but NOT if they are mutable). The issue is that this can currently be expensive to figure out (as you need to iterate over and call hash on each element). xref

e.g. imagine a pd.date_range(....., ...), then unique, monotonic, has_nulls are trivial to compute at creation time. Since this is currently an Index in pandas it is immutable by-definition.

xref pandas-dev/pandas#12272, pandas-dev/pandas#14266

The text was updated successfully, but these errors were encountered:

chris-b1 · 2016-09-21T17:32:55Z

API question - what does it look like to opt-in to one of these checks? As a specific example, I've used this "optimization" a few times to speed up merges on a monotonic column.

a.merge(b, on='sorted_col')

# takes advantage of monotonicity
a.set_index('sorted_col').join(b.set_index('sorted_col'))

What should that look like? Could be something like this, although maybe should be even more hidden as "advanced api" to avoid too many parameters on basic functions?

a.merge(b, on='sorted_col', check_monotonicity=True)

check_monotonicity= {'infer' | True | False}

wesm · 2016-09-21T18:13:19Z

Things like monotonicity are so cheap to check and provide such significant performance benefits when they are known, that I would support always checking when it may be advantageous.

These attributes can be cached and invalidated whenever the array is mutated (we'd have to have a "dirty" flag to indicate that any cached array statistics need to be recomputed)

llllllllll · 2016-10-11T23:36:34Z

Regarding immutabity: What should happen if a user creates a series from an immutable array, and then later sets the array to mutable and mutates it. I think a valid answer is "don't do that", but it should be explicitly defined. If that should be supported behavior you could forward checks to immutable down to the underlying storage's check each time. The small indirection shouldn't be too expensive but idk if you can cache that.

wesm · 2016-10-12T02:30:45Z

@llllllllll when you create a pandas.Series from an pandas.Array you are actually obtaining a view on that array, so if the source array mutates itself, it triggers copy-on-write (because it observes that it's use count is > 1). So this will be a non-issue.

jreback added the performance label Sep 30, 2016

shoyer mentioned this issue Oct 18, 2016

DESIGN: NA values in floating point arrays #46

Open

jbrockmendel mentioned this issue Feb 15, 2018

Implement maybe_cache for compat between immutable/mutable classes pandas-dev/pandas#19709

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lazy array attributes #27

lazy array attributes #27

jreback commented Sep 21, 2016 •

edited

Loading

chris-b1 commented Sep 21, 2016

wesm commented Sep 21, 2016

llllllllll commented Oct 11, 2016

wesm commented Oct 12, 2016

lazy array attributes #27

lazy array attributes #27

Comments

jreback commented Sep 21, 2016 • edited Loading

chris-b1 commented Sep 21, 2016

wesm commented Sep 21, 2016

llllllllll commented Oct 11, 2016

wesm commented Oct 12, 2016

jreback commented Sep 21, 2016 •

edited

Loading