-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding image driver and restructuring #25
Conversation
Are you around the coming week to take me through the proposal here? |
Just tried playing with this, but I can't figure out why |
@philippjfr , I believe @jsignell is not yet back from holidays. Maybe you get an explicit error when trying to import ImageSource? |
Thanks for letting me know. No error when I try to import |
That I can answer: Intake tries to import any package with the name |
@martindurant I forgot to mention in our chat that this PR also adds the ability to return datasets using the kwarg |
I would say that adding extra capabilities that may be useful for some is totally in scope, so long as it doesn't ass complexity for those that don't need it. |
for k, values in field_values.items() if k != self.merge_dim | ||
} | ||
|
||
def _open_files(self, files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of logic is in this method. Not sure how to split it up better though. Essentially there are 4 paths through, no pattern and merge_dim, no pattern and concat_dim, pattern and concat_dim, or pattern and merge_dim. The first two are pretty straightforward and then they increase in complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look too hairy. Perhaps could do with a comment on each branch, saying what it does, and a docstring with your comment above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I like the restructure and the new image driver. This is blog-worthy!
I have questions around local versus remote files, and comments/questions elsewhere.
I have not gone through the example notebooks yet.
intake_xarray/base.py
Outdated
self._multireader = multireader or xr.open_mfdataset | ||
super(XarraySource, self).__init__(metadata=metadata) | ||
|
||
def reader(self, filename, chunks, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like an attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to set a default while still enforcing that plugins define the reader and multireader. But maybe this isn't the right way...
|
||
try: | ||
import xarray as xr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the import be deferred until it is needed? This module will be imported upon import intake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't that just make these unavailable which is what we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case that xarray is available, it'll make the import of intake slower. In the case it isn't, the module won't load, but the exception will get swallowed.
If deferred, the module would load OK, but when the user tries to access the data, then they'll get the message, saying that they need to install something if they want to use that source.
intake_xarray/base.py
Outdated
'which takes at least filename, and chunks ' | ||
'and returns an xarray object') | ||
|
||
def multireader(self, filename, chunks, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also
for k, values in field_values.items() if k != self.merge_dim | ||
} | ||
|
||
def _open_files(self, files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look too hairy. Perhaps could do with a comment on each branch, saying what it does, and a docstring with your comment above
def _open_files(self, files): | ||
das = [self.reader(f, chunks=self.chunks, **self.kwargs) | ||
for f in files] | ||
if not self.pattern: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps some idea of which of these attributes are mutually exclusive
elif os.path.isfile(filename): | ||
filenames = [filename] | ||
else: | ||
filenames = [filename] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expected use-case here? We want to allow passing a directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remote file - most likely url. I added a comment.
http://docs.dask.org/en/latest/array-api.html#dask.array.image.imread | ||
for possible extra arguments. | ||
|
||
NOTE: Although ``skimage.io.imread`` is used by default, any reader function which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should somewhere give an example of how that works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or just remove the capability??)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The capability came from dask, but yeah I think you are right given how hard it is to write lambda functions in yaml it is probably better to just use skimage.io.imread
until someone asks to be able to use another reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if it's that is importable, it's easy to do: !!python/name:mymodule.process
. Could be seen as a future enhancement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable.
intake_xarray/netcdf.py
Outdated
for the file formats supported and possible extra arguments. | ||
|
||
NOTE: When reading from OpenDAP URLs do not set the ``chunks`` option to | ||
use provided default chunking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an explanation under the opendap driver that it handles the auth part - should say here explicitly and that the other driver may be necessary
Some examples: | ||
- ``s3://data/*.nc`` | ||
- ``http://thredds.ucar.edu/thredds/dodsC/grib/FNMOC/WW3/Global_1p0deg/Best`` | ||
- ``https://github.com/pydata/xarray-data/blob/master/air_temperature.nc?raw=true`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does this handle remote URLs directly or not? I assume if it is opendap, then yes, in which case the thing about needing caching is wrong (and in fact won't work).
The thredds URL gives Unrecognized Request for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is terrible wording. I think I meant that OpenDAP urls can/should be used directly and all others with caching. You can't GET thredds urls like that directly. The 400 is correct.
@@ -16,7 +16,7 @@ def test_discover(source, cdf_source, zarr_source, dataset): | |||
r = source.discover() | |||
|
|||
assert r['datashape'] is None | |||
assert r['dtype'] is None | |||
assert r['dtype'] == 'float32' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole dataset has a single dtype?
After conversation with Martin this chunk of work seems unreasonably big. So I am going to make a new PR that just adds an ImagePlugin. |
Closed in favour of #28 . The refactor may become necessary again at a later stage. |
This PR adds an image reader for
xarray
and does a restructure to have a common structure for all plugins inintake-xarray
. In particular this adds the ability to return a dataset rather than a dataarray using themerge_dim
option.I think this will help justify the existence of an intake-xarray plugin at all since it can now be used directly and the other plugins are just helper for using specific readers.
It is likely that this needs lots of work. I rewrote the example notebook to try to explain some of the behavior.