-
Notifications
You must be signed in to change notification settings - Fork 41
lazy array attributes #27
Comments
API question - what does it look like to opt-in to one of these checks? As a specific example, I've used this "optimization" a few times to speed up merges on a monotonic column.
What should that look like? Could be something like this, although maybe should be even more hidden as "advanced api" to avoid too many parameters on basic functions?
|
Things like monotonicity are so cheap to check and provide such significant performance benefits when they are known, that I would support always checking when it may be advantageous. These attributes can be cached and invalidated whenever the array is mutated (we'd have to have a "dirty" flag to indicate that any cached array statistics need to be recomputed) |
Regarding immutabity: What should happen if a user creates a series from an immutable array, and then later sets the array to mutable and mutates it. I think a valid answer is "don't do that", but it should be explicitly defined. If that should be supported behavior you could forward checks to |
@llllllllll when you create a pandas.Series from an pandas.Array you are actually obtaining a view on that array, so if the source array mutates itself, it triggers copy-on-write (because it observes that it's use count is > 1). So this will be a non-issue. |
IIRC this from the design docs, but wanted to make an issue to remember. We want to have a set of lazily computed array attributes. Sometimes these can be set at creation time based on the creation method / dtype. If the array is immutable then these are not affected by indexing checks.
immutability
/ read-only, xref block mutation of read-only array in series pandas-dev/pandas#14359unique
is_monotonic*
has_nulls
is_hashable
- only on non homogeneous dtype, usually true onobject
dtypes (but NOT if they are mutable). The issue is that this can currently be expensive to figure out (as you need to iterate over and callhash
on each element). xrefe.g. imagine a
pd.date_range(....., ...)
, thenunique
,monotonic
,has_nulls
are trivial to compute at creation time. Since this is currently anIndex
in pandas it isimmutable
by-definition.xref pandas-dev/pandas#12272, pandas-dev/pandas#14266
The text was updated successfully, but these errors were encountered: