Skip to content
This repository has been archived by the owner on Apr 10, 2024. It is now read-only.

lazy array attributes #27

Open
jreback opened this issue Sep 21, 2016 · 4 comments
Open

lazy array attributes #27

jreback opened this issue Sep 21, 2016 · 4 comments

Comments

@jreback
Copy link

jreback commented Sep 21, 2016

IIRC this from the design docs, but wanted to make an issue to remember. We want to have a set of lazily computed array attributes. Sometimes these can be set at creation time based on the creation method / dtype. If the array is immutable then these are not affected by indexing checks.

  • immutability / read-only, xref block mutation of read-only array in series pandas-dev/pandas#14359
  • unique
  • is_monotonic*
  • has_nulls
  • is_hashable - only on non homogeneous dtype, usually true on object dtypes (but NOT if they are mutable). The issue is that this can currently be expensive to figure out (as you need to iterate over and call hash on each element). xref

e.g. imagine a pd.date_range(....., ...), then unique, monotonic, has_nulls are trivial to compute at creation time. Since this is currently an Index in pandas it is immutable by-definition.

xref pandas-dev/pandas#12272, pandas-dev/pandas#14266

@chris-b1
Copy link

API question - what does it look like to opt-in to one of these checks? As a specific example, I've used this "optimization" a few times to speed up merges on a monotonic column.

a.merge(b, on='sorted_col')

# takes advantage of monotonicity
a.set_index('sorted_col').join(b.set_index('sorted_col'))

What should that look like? Could be something like this, although maybe should be even more hidden as "advanced api" to avoid too many parameters on basic functions?

a.merge(b, on='sorted_col', check_monotonicity=True)

check_monotonicity= {'infer' | True | False}

@wesm
Copy link
Owner

wesm commented Sep 21, 2016

Things like monotonicity are so cheap to check and provide such significant performance benefits when they are known, that I would support always checking when it may be advantageous.

These attributes can be cached and invalidated whenever the array is mutated (we'd have to have a "dirty" flag to indicate that any cached array statistics need to be recomputed)

@llllllllll
Copy link

Regarding immutabity: What should happen if a user creates a series from an immutable array, and then later sets the array to mutable and mutates it. I think a valid answer is "don't do that", but it should be explicitly defined. If that should be supported behavior you could forward checks to immutable down to the underlying storage's check each time. The small indirection shouldn't be too expensive but idk if you can cache that.

@wesm
Copy link
Owner

wesm commented Oct 12, 2016

@llllllllll when you create a pandas.Series from an pandas.Array you are actually obtaining a view on that array, so if the source array mutates itself, it triggers copy-on-write (because it observes that it's use count is > 1). So this will be a non-issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants