Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding image driver and restructuring #25

Closed
wants to merge 6 commits into from
Closed

Conversation

jsignell
Copy link
Member

@jsignell jsignell commented Dec 13, 2018

This PR adds an image reader for xarray and does a restructure to have a common structure for all plugins in intake-xarray. In particular this adds the ability to return a dataset rather than a dataarray using the merge_dim option.

I think this will help justify the existence of an intake-xarray plugin at all since it can now be used directly and the other plugins are just helper for using specific readers.

It is likely that this needs lots of work. I rewrote the example notebook to try to explain some of the behavior.

screen shot 2018-12-13 at 1 52 15 pm

@martindurant
Copy link
Member

Are you around the coming week to take me through the proposal here?

@philippjfr
Copy link

Just tried playing with this, but I can't figure out why intake.open_image doesn't exist for me.

@martindurant
Copy link
Member

@philippjfr , I believe @jsignell is not yet back from holidays. Maybe you get an explicit error when trying to import ImageSource?

@philippjfr
Copy link

philippjfr commented Jan 6, 2019

Thanks for letting me know. No error when I try to import ImageSource directly. At some point I should clearly read some of the internals of the plugin system to understand how the function gets registered.

@martindurant
Copy link
Member

That I can answer: Intake tries to import any package with the name intake_*, and looks for subclasses of DataSource in the top level. For each found, it registers the class by the name class attribute and generates the open function. They should appear in the top-level registry dict, and the functions that do the importing are in source.discovery.

@jsignell
Copy link
Member Author

jsignell commented Jan 9, 2019

@martindurant I forgot to mention in our chat that this PR also adds the ability to return datasets using the kwarg merge_dim rather than only dataarrays. I am not sure if that is an overstep for intake or not, but it does seem like a bit of munging that it would be very handy to be able to encode in a catalog.

@martindurant
Copy link
Member

I would say that adding extra capabilities that may be useful for some is totally in scope, so long as it doesn't ass complexity for those that don't need it.

@jsignell jsignell requested a review from martindurant January 9, 2019 16:29
for k, values in field_values.items() if k != self.merge_dim
}

def _open_files(self, files):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of logic is in this method. Not sure how to split it up better though. Essentially there are 4 paths through, no pattern and merge_dim, no pattern and concat_dim, pattern and concat_dim, or pattern and merge_dim. The first two are pretty straightforward and then they increase in complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look too hairy. Perhaps could do with a comment on each branch, saying what it does, and a docstring with your comment above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I like the restructure and the new image driver. This is blog-worthy!

I have questions around local versus remote files, and comments/questions elsewhere.

I have not gone through the example notebooks yet.

self._multireader = multireader or xr.open_mfdataset
super(XarraySource, self).__init__(metadata=metadata)

def reader(self, filename, chunks, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like an attribute

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to set a default while still enforcing that plugins define the reader and multireader. But maybe this isn't the right way...


try:
import xarray as xr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the import be deferred until it is needed? This module will be imported upon import intake

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't that just make these unavailable which is what we want?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that xarray is available, it'll make the import of intake slower. In the case it isn't, the module won't load, but the exception will get swallowed.
If deferred, the module would load OK, but when the user tries to access the data, then they'll get the message, saying that they need to install something if they want to use that source.

'which takes at least filename, and chunks '
'and returns an xarray object')

def multireader(self, filename, chunks, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also

for k, values in field_values.items() if k != self.merge_dim
}

def _open_files(self, files):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look too hairy. Perhaps could do with a comment on each branch, saying what it does, and a docstring with your comment above

def _open_files(self, files):
das = [self.reader(f, chunks=self.chunks, **self.kwargs)
for f in files]
if not self.pattern:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps some idea of which of these attributes are mutually exclusive

elif os.path.isfile(filename):
filenames = [filename]
else:
filenames = [filename]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected use-case here? We want to allow passing a directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remote file - most likely url. I added a comment.

http://docs.dask.org/en/latest/array-api.html#dask.array.image.imread
for possible extra arguments.

NOTE: Although ``skimage.io.imread`` is used by default, any reader function which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should somewhere give an example of how that works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or just remove the capability??)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capability came from dask, but yeah I think you are right given how hard it is to write lambda functions in yaml it is probably better to just use skimage.io.imread until someone asks to be able to use another reader

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if it's that is importable, it's easy to do: !!python/name:mymodule.process. Could be seen as a future enhancement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable.

for the file formats supported and possible extra arguments.

NOTE: When reading from OpenDAP URLs do not set the ``chunks`` option to
use provided default chunking.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an explanation under the opendap driver that it handles the auth part - should say here explicitly and that the other driver may be necessary

Some examples:
- ``s3://data/*.nc``
- ``http://thredds.ucar.edu/thredds/dodsC/grib/FNMOC/WW3/Global_1p0deg/Best``
- ``https://github.com/pydata/xarray-data/blob/master/air_temperature.nc?raw=true``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this handle remote URLs directly or not? I assume if it is opendap, then yes, in which case the thing about needing caching is wrong (and in fact won't work).
The thredds URL gives Unrecognized Request for me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is terrible wording. I think I meant that OpenDAP urls can/should be used directly and all others with caching. You can't GET thredds urls like that directly. The 400 is correct.

@@ -16,7 +16,7 @@ def test_discover(source, cdf_source, zarr_source, dataset):
r = source.discover()

assert r['datashape'] is None
assert r['dtype'] is None
assert r['dtype'] == 'float32'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole dataset has a single dtype?

@jsignell
Copy link
Member Author

After conversation with Martin this chunk of work seems unreasonably big. So I am going to make a new PR that just adds an ImagePlugin.

@martindurant
Copy link
Member

Closed in favour of #28 . The refactor may become necessary again at a later stage.

@jsignell jsignell deleted the jsignell/image branch January 24, 2019 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants