Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: explicit indexes #2195

Closed
wants to merge 4 commits into from
Closed

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented May 29, 2018

Some utility functions that should be useful for #1603

Still very much a work in progress -- it would be great if someone has time to finish writing any of these in another PR!

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@@ -314,6 +315,71 @@ def __unicode__(self):
return formatting.indexes_repr(self)


def normalize_indexes(indexes, coords, sizes):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(will need =None)

indexes = {} if indexes is None else dict(indexes)

# default indexes
for key in sizes:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow the sizes to be optional, and drawn from the indexes / coords? At the moment they must be supplied in sizes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was imagining calling this function after the Dataset/DataArray is filled in with fully specified xarray.Variable objects, so we would already have determined sizes for all dimensions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, that makes sense

Copy link
Collaborator

@max-sixty max-sixty May 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually (and I now realize from the other PR this is currently a sketch) then we likely want to check the sizes are consistent if they're also supplied in indexes or coords

else:
variable = Variable(dims, data, *args, fastpath=True)

return OrderedDict([(name, variable)])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -314,6 +315,71 @@ def __unicode__(self):
return formatting.indexes_repr(self)


def normalize_indexes(indexes, coords, sizes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to follow this PR up, but not yet sure what this utility function does.

Can you suggest an example usage of this?
What I currently think of is

  • Without MultiIndex
    coords = OrderedDict()
    coords['x'] = Variable(('x', ), [0, 1, 2])
    coords['y'] = Variable(('y', ), ['a', 'b'])
    sizes = {'x': 3, 'y': 2}
    actual = normalize_indexes({}, coords, sizes)
    expected = {'x': coords['x'].to_index(), 'y': coords['y'].to_index()}
  • With MultiIndex
    coords = OrderedDict()
    coords['x'] = Variable(('z', ), [0, 1, 2])
    coords['y'] = Variable(('z', ), ['a', 'b'])
    sizes = {'x': 6}
    actual = normalize_indexes({}, coords, sizes)
    expected = {'x': pd.Index([0, 0, 1, 1, 2, 2]),
                'y': pd.Index(['a', 'b', 'a', 'b', 'a', 'b'])}

Or do we need a MultiIndex for 'z' ?

Copy link
Member Author

@shoyer shoyer Aug 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your "without MultiIndex" example look correct, but I think your "with MultiIndex" case should raise an error, because the dimension "z" has inconsistent sizes.

I was thinking something more like:

    indexes = {
        'x': pandas.Index([0, 0, 1, 1, 2, 2]),
        'y': pandas.Index(['a', 'b', 'a', 'b', 'a', 'b']),
    }
    coords = {
        'x': Variable(('z',), [0, 0, 1, 1, 2, 2]),
        'y': Variable(('z',), ['a', 'b', 'a', 'b', 'a', 'b']),
    }
    sizes = {'z': 6}
    actual = normalize_indexes(indexes, coords, sizes)
    combined_index = pandas.MultiIndex.from_arrays(
        [indexes['x'], indexes['y']], names=['x', 'y'])
    expected = {'x': combined_index, 'y': combined_index, 'z': combined_index}

The resulting object would supporting indexing with any of .sel(x=...), .sel(y=...) or .sel(z=...), all based on combined_index.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@pep8speaks
Copy link

Hello @shoyer! Thanks for updating the PR.

Line 165:80: E501 line too long (85 > 79 characters)

@shoyer shoyer changed the title WIP: utility functions for working with explicit coordinates WIP: explicit indexes Nov 28, 2018
@shoyer
Copy link
Member Author

shoyer commented Nov 28, 2018

I've pushed a few commits with some additional progress here. Here's my current plan:

  1. Automatically create new variables for MultiIndex levels, and store these in Dataset._variables / DataArray._coords. This lets us use the normal variable consistency checks and indexing operations, rather than requiring explicit checks like assert_unique_multiindex_level_names or "virtual variables". This is mostly complete (still under-tested) as of the last commit, but it should be a big win for consistency when using a MultiIndex since the level variables will be built into xarray's data model. The MultiIndex itself will still be an xarray variable (an object array of tuples).
  2. Add explicit indexes, as a map from coordinate names to a pandas.Index or pandas.MultiIndex used for indexing. Note that if a MultiIndex is used, coordinates corresponding to levels will all point to the same MultiIndex, e.g., if MultiIndex foo has levels x and y, indexes will look like {'foo': foo, 'x': foo, 'y': foo}.
  3. Remove/fixup code that assumes that only dimension names are used for indexes. Instead, we look for indexes by looking up coordinate names in .indexes.
  4. Add an xarray specific Index API so we can add our own index classes, e.g., a KDTree. This will happen in a later PR.

@benbovy
Copy link
Member

benbovy commented Mar 21, 2022

I guess we can close this one now that #5692 is merged?

@shoyer shoyer closed this Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants