Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation on custom indexes #6975

Merged
merged 18 commits into from
Jul 17, 2023
Merged

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Sep 1, 2022

This PR documents the API of the Index base class and adds a guide for creating custom indexes (reworked from https://hackmd.io/Zxw_zCa7Rbynx_iJu6Y3LA). Hopefully it will help anyone experimenting with this feature.

@pydata/xarray your feedback would be very much appreciated! I've been into this for quite some time, so there may be things that seem obvious to me but that you can still find very confusing or non-intuitive. It would then deserve some extra or better explanation.

More specifically, I'm open to any suggestion on how to better illustrate this with clear and succinct examples.

There are other parts of the documentation that still need to be updated regarding the indexes refactor (e.g., "dimension" coordinates, xindexes property, set/drop indexes, etc.). But I suggest to do that in separate PRs and focus here on creating custom indexes.

@benbovy benbovy marked this pull request as ready for review September 2, 2022 13:24
xarray/core/indexes.py Outdated Show resolved Hide resolved
@benbovy benbovy mentioned this pull request Sep 5, 2022
49 tasks
@benbovy
Copy link
Member Author

benbovy commented Sep 7, 2022

I'm open to any suggestion on how to better illustrate this with clear and succinct examples.

Maybe an inefficient "numpy" index with basic lookup (like in #3925 (comment)) would be a good example?

@dcherian
Copy link
Contributor

dcherian commented Sep 9, 2022

an inefficient "numpy" index with basic lookup

yes! I used this recently to describe what an index does. I think most people are familiar with the argmin way

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

I have just minor suggestions

@@ -461,6 +461,21 @@
CFTimeIndex.values
CFTimeIndex.year

Index.from_variables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just make it public?

doc/internals/how-to-create-custom-index.rst Outdated Show resolved Hide resolved
doc/internals/how-to-create-custom-index.rst Outdated Show resolved Hide resolved
doc/internals/how-to-create-custom-index.rst Outdated Show resolved Hide resolved
doc/internals/how-to-create-custom-index.rst Outdated Show resolved Hide resolved
xarray/core/indexes.py Outdated Show resolved Hide resolved
raise NotImplementedError(
f"{self!r} doesn't support alignment with inner/outer join method"
)

def reindex_like(self: T_Index, other: T_Index) -> dict[Hashable, Any]:
"""Query the index with another index of the same type.

Implementation is optional but required in order to support alignment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to be specific about types of alignment here. Do we also need to mention reindexing?

xarray/core/indexes.py Outdated Show resolved Hide resolved
xarray/core/indexes.py Outdated Show resolved Hide resolved
xarray/core/indexes.py Outdated Show resolved Hide resolved
@TomNicholas
Copy link
Member

TomNicholas commented Sep 14, 2022

pydata/xarray your feedback would be very much appreciated! I've been into this for quite some time, so there may be things that seem obvious to me but that you can still find very confusing or non-intuitive. It would then deserve some extra or better explanation.

I find the way in which the Index objects are involved in a method call (e.g. sel) pretty opaque still. (Bear in mind I don't think I've ever contributed a PR to xarray that touched indexes.py or indexing.py 😅)

  • When is my CustomIndex object consulted? Is it potentially for all basic operations (concat, join, align, indexing, etc?) If I passed some kind of NotImplementedIndex which only defined .from_variables, what functionality would be left in xarray?
  • Why does there not need to be an index (in .indexes) if I do indexing with e.g. .isel but have no coordinates?
  • If I index along two dimensions simultaneously (.isel(x=1, y=2)) does that correspond to two separate Index consultations?
  • How is an xarray.PandasIndex different from a pd.Index?
  • When I load the "air" tutorial data and it shows a Float64Index and DateTime64Index, where did they come from?
  • Finally it's not great that to explain Index objects we have to assume the user knows what xarray.Variable is, but Variable is still not really public API, and certainly isn't documented as comprehensively as DataArray and Dataset are.

Perhaps these questions are too specific to xarray's internals but I do think there should be some kind of mental model given as to the role the Index objects play. (This could be a white lie, similar to how our page on data structures says that Dataset objects contain DataArray objects when actually they technically don't.)

I also think we need multiple simple examples. How about

  • HelloWorldIndex, that just prints "I'm calling HelloWorldIndex.isel !" etc.
  • PeriodicBoundaryIndex (a partial/simpler implementation of Periodic Boundary Index #7031),
  • A simple functionally-derived index, that consults a dynamically-called exponential function or something.
  • Some example of a custom multi-index, perhaps some kind of 2D lat-lon thing? Or how about something that represents 2D image distortion, like using it creates a fisheye effect?

I find it helps to think of documentation using this 4-part system. This PR should cover "Explanation" pretty well, but we should still aim for other content to better cover "Tutorial", "How-to Guides", and "Reference". "Tutorial" could be like a notebook walking through creating a simple index (e.g. PeriodicIndex) from scratch, explaining and fixing errors as they arise (like @dcherian did for apply_ufunc). "How-to Guides" might be specific to other more advanced custom index examples. I guess "Reference" here is just having really clear docstrings on the possible methods of Index somewhere (we can't really do that with an ABC though can we?).

That all said, this is already a great start!

@benbovy
Copy link
Member Author

benbovy commented Sep 14, 2022

Thanks @dcherian and @TomNicholas for your feeback!

@dcherian I will reply to your inline comments when I'll integrate your suggestions in this PR.

@TomNicholas I answer to your comments below.

Bear in mind I don't think I've ever contributed a PR to xarray that touched indexes.py or indexing.py

That's exactly why your feedback is valuable!

When is my CustomIndex object consulted? Is it potentially for all basic operations (concat, join, align, indexing, etc?)

I agree this could be detailed more in the Index API docstrings in a consistent way. For some methods like equals, join and reindex_like it could be called in a lot of places, basically everything that relies on object alignment.

Why does there not need to be an index (in .indexes) if I do indexing with e.g. .isel but have no coordinates?

We should clarify that the aim of Index objects is to make more efficient all the operations made in the (discrete or continuous) space defined by the coordinate labels. That space is distinct from the discrete space defined by array element locations. All operations made in the latter space don't require any index.

Some Index API like Index.isel suggest otherwise, but those methods are rather for convenience, i.e., avoid users having to rebuild an index from scratch when it could be easily built from the existing one.

How is an xarray.PandasIndex different from a pd.Index?

I've tried to explain it in the "Index base class" section and the sections below, but maybe it should be emphasized more?

When I load the "air" tutorial data and it shows a Float64Index and DateTime64Index, where did they come from?

I guess you mean it is shown through ds.indexes?

ds.xindexes (vs. ds.indexes) still needs to be added in the docs (in a later PR?), which hopefully will address your concern here.

Finally it's not great that to explain Index objects we have to assume the user knows what xarray.Variable is, but Variable is still not really public API, and certainly isn't documented as comprehensively as DataArray and Dataset are.

I agree, although Variable is already documented in the "Xarray internals" section. Maybe Index also deserves its own entry there, where we could explain what indexes are, how they are different from variables (coordinates), how they are used or accessed in Xarray, etc.

Overall, I think that the whole "Xarray Internals" section could be streamlined beyond a bunch of loosely-coupled document pages.

I also think we need multiple simple examples.

I agree that we need more examples, but I also think that too much examples may tend to make things more confused.

One thing that I like very much in https://fastapi.tiangolo.com/ is how a small example is picked for each tutorial and then is shown by highlighting the relevant code for every subsection. Is it possible to do that with Sphinx / RST?

It's hard to show all features through one succinct example, though. Like @dcherian says in #6975 (comment), we could invite people to look into the PandasIndex and PandasMultiIndex code for more details. My hope is that there will be more real examples (multi-coordinate, multi-dimensions) available in the future.

@TomNicholas
Copy link
Member

TomNicholas commented Sep 14, 2022

We should clarify that the aim of Index objects is to make more efficient all the operations made in the (discrete or continuous) space defined by the coordinate labels. That space is distinct from the discrete space defined by array element locations.

I think this should be one of the first things said. It defines what all the following discussion of Indexes does and does not affect.

I've tried to explain it in the "Index base class" section and the sections below, but maybe it should be emphasized more?

Yeah I think you do actually have that one covered, I just included it as another example of a naive question that everyone will have that is worth heading off very explicitly.

When I load the "air" tutorial data and it shows a Float64Index and DateTime64Index, where did they come from?

I guess you mean it is shown through ds.indexes?

ds.xindexes (vs. ds.indexes) still needs to be added in the docs (in a later PR?), which hopefully will address your concern here.

I meant like when did these indexes get automatically built? (Presumably on coordinate assignment)

Maybe Index also deserves its own entry there, where we could explain what indexes are, how they are different from variables (coordinates), how they are used or accessed in Xarray, etc.

1000% yes we need a page that explains what Index objects are, what they do, and how they work, and how they are handled automatically by default. This is pre-requisite knowledge (which apparently I don't have 😅 ) before trying to build your own custom index.

Overall, I think that the whole "Xarray Internals" section could be streamlined beyond a bunch of loosely-coupled document pages.

Probably, but having a loosely coupled page for each aspect of the internals would be a good initial aim.

I agree that we need more examples, but I also think that too much examples may tend to make things more confused.

That's why I like the "Explanation" vs "How-to" vs "Tutorials" distinction: use minimal code in the "Explanation" section (this PR) but put multiple more complex examples under "How to create a functionally-derived index", "how-to create a lazy index" etc.

Is it possible to do that with Sphinx / RST?

No idea, but that does look cool!


@classmethod
def from_variables(cls, variables: Mapping[Any, Variable]) -> Index:
def from_variables(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangential but couldn't this be decorated to make it an abstract (class) method? Then an error would be raised if subclasses don't implement it. i.e. this:

from abc import ABC, abstractmethod

class Index(ABC):
     @classmethod
     @abstractmethod
     def from_variables(cls, vars):
         ...

     def sel(self, indexers):
         raise NotImplementedError()
 

@jsignell
Copy link
Contributor

Just a note that there are still some unmerged suggestions that could easily be incorporated before merge.

@TomNicholas
Copy link
Member

@benbovy do you mind if we merge this and leave the comments to address in a follow-up PR? At SciPy we met some advanced developers from other fields (core devs of napari and astropy) who are very interested in creating functional indexes for their use cases, so it would be nice to merge this.

@TomNicholas TomNicholas mentioned this pull request Jul 17, 2023
4 tasks
@benbovy
Copy link
Member Author

benbovy commented Jul 17, 2023

@TomNicholas I don't mind at all! Let's merge this even though it is not perfect. I'll improve it and address the comments made here in a follow-up PR. I'll also add Index and Coordinates (once #7368 is merged) sections in the internals page added in #7991.

@benbovy
Copy link
Member Author

benbovy commented Jul 17, 2023

advanced developers from other fields (core devs of napari and astropy) who are very interested in creating functional indexes for their use cases

There is an example of a functional index in this discussion (last item): #7041 (comment)

@TomNicholas TomNicholas enabled auto-merge (squash) July 17, 2023 21:02
@TomNicholas TomNicholas disabled auto-merge July 17, 2023 21:04
@dcherian dcherian merged commit 7234603 into pydata:main Jul 17, 2023
dcherian added a commit to dcherian/xarray that referenced this pull request Jul 24, 2023
…lazy-array

* upstream/main: (153 commits)
  Add HDF5 Section to read/write docs page (pydata#8012)
  [pre-commit.ci] pre-commit autoupdate (pydata#8014)
  Update interpolate_na in dataset.py (pydata#7974)
  improved docstring of to_netcdf (issue pydata#7127) (pydata#7947)
  Expose "Coordinates" as part of Xarray's public API (pydata#7368)
  Core team member guide (pydata#7999)
  join together duplicate entries in the text `repr` (pydata#7225)
  Update copyright year in README (pydata#8007)
  Allow opening datasets with nD dimenson coordinate variables. (pydata#7989)
  Move whats-new entry
  [pre-commit.ci] pre-commit autoupdate (pydata#7997)
  Add documentation on custom indexes (pydata#6975)
  Use variable name in all exceptions raised in `as_variable` (pydata#7995)
  Bump pypa/gh-action-pypi-publish from 1.8.7 to 1.8.8 (pydata#7994)
  New whatsnew section
  Remove future release notes before this release
  Update whats-new.rst for new release (pydata#7993)
  Remove hue_style from plot1d docstring (pydata#7925)
  Add new what's new section (pydata#7986)
  Release summary for v2023.07.0 (pydata#7979)
  ...
@benbovy benbovy deleted the add-index-doc branch August 30, 2023 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants