-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow custom dataset names in 'generic_image' reader and fix nodata handling #1560
Allow custom dataset names in 'generic_image' reader and fix nodata handling #1560
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1560 +/- ##
==========================================
+ Coverage 92.62% 92.65% +0.02%
==========================================
Files 258 258
Lines 37847 37929 +82
==========================================
+ Hits 35057 35142 +85
+ Misses 2790 2787 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
def test_GenericImageFileHandler_datasetid(self): | ||
"""Test direct use of the reader.""" | ||
from satpy.readers.generic_image import GenericImageFileHandler | ||
from satpy.readers.generic_image import mask_image_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F401 'satpy.readers.generic_image.mask_image_data' imported but unused
foo = make_dataid(name='image-custom') | ||
self.assertTrue(reader.file_content) | ||
dataset = reader.get_dataset(foo, None) | ||
self.assertTrue(isinstance(dataset, xr.DataArray)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W292 no newline at end of file
Could you explain the motivation behind this change? |
Sorry, should be a draft only. My primary objectiv was to create the DWD world composite with a new trollflow2/satpy chain. I have geotiff files of all satellites on disk and segment gatherer should be used to publish an event when everything is available. Then trollflow2/satpy should use the generic image reader to read all images together into the same scene object. I've created a custom reader.yml there I specified each satellites' image as channel. Afterwards I'll try to create a new compositor to stich the images. To read all images together, this patch is necessary. I can provide the reader.yml when I'm back in office. |
Ah I see, the main problem is:
This doesn't use the key provided in the YAML at all. Maybe it would be better to copy the Just some brainstorming. |
Here is an example of the reader config that I'd like to use to read the images into the same scene (GOES16 + 17 is only the first step): reader:
name: world_generic_image
description: generic image reader
reader: !!python/name:satpy.readers.yaml_reader.FileYAMLReader
sensors: [images]
default_channels: [image_goes16, image_goes17]
datasets:
image_goes16:
name: image_goes16
file_type: graphic_goes16
image_goes17:
name: image_goes17
file_type: graphic_goes17
file_types:
graphic_goes16:
file_reader: !!python/name:satpy.readers.generic_image.GenericImageFileHandler
file_patterns:
- 'goes16_{filename}{start_time:%Y%m%d%H%M}.tif'
- 'goes16_{filename}.tif'
graphic_goes17:
file_reader: !!python/name:satpy.readers.generic_image.GenericImageFileHandler
file_patterns:
- 'goes17_{filename}{start_time:%Y%m%d%H%M}.tif'
- 'goes17_{filename}.tif' |
def read(self):
"""Read the image."""
dataset = rasterio.open(self.finfo['filename'])
...
self.file_content[key['name']] = data ... until I noticed that it's not available in this stage. I've thought about moving the whole reading stuff to |
Yeah hard to say. @pnuu I think wrote this reader originally and mind have ideas for what can and can't be done on |
Sorry, this might not be very helpful in respect to the changes needed to do this purely in Satpy.
This sounds very much like It collected the images similar to segment gatherer, then cut the input images very naïvely at the GEO satellites' mid-point longitudes and put the images together. It required the input images to be in latlon grid, didn't do any blending, and had some other simplifications. My current solution is to rely on putting the images (in what ever projection they are generated in) to WMS and let the client software do the requests so that the images are in correct projection, and let it also handle (possible) blending, arrangement and everything else. |
@pnuu I remember that we discussed about it in the past. Currently we still have a similar solution running. It uses some reading and writing functions provided by good old mpop. It has to be migrated. I hope to use as much satpy/trollflow2 functionality as possible. It should be parametrized as all the other stuff. The core mosaic creation would be done in a new satpy compositor. Geoserver is not an option for our use case. |
So would the use be like this? fnames = ["goes16_true_color_202102241200.tif", "goes17_true_color_202102241200.tif"]
scn = Scene(reader="world_generic_reader", filenames=fnames)
scn.load(["image_goes16", "image_goes17"]) And another question: how are you planning to handle resampling so that the intertwined pixels do not cause weird effects where the images overlap? The plain nearest neighbour resampling will just blindly use the closest pixel to each target location, so the illumination, parallax and (very slightly) calibration will cause artifacts on the overlapping areas. Maybe |
Does trollflow2 work with |
No, not at the moment. Creating a plugin that supports the usage of it should be pretty simple. |
I've added some code to handle the fill_value when reading GeoTIFFs.
@pnuu: What do think? I see my PR more as a fix than an enhancement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple suggestions to make this more flexible...I think.
satpy/readers/generic_image.py
Outdated
@@ -117,4 +117,9 @@ def mask_image_data(data): | |||
for i in range(data.shape[0])]) | |||
data.data = masked_data | |||
data = data.sel(bands=BANDS[data.bands.size - 1]) | |||
elif hasattr(data, 'nodatavals') and data.nodatavals: | |||
data = data.astype(np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have to convert to float here? If we can keep the initial data type and use a fill value dependent on that data type then it may work for more situations. What if you copied the nodatavals to .attrs['_FillValue']
for integer types, but replaced them with np.nan
for floats? This would match what we do in other readers for "category" products.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure what's better option. Unfortunately, NaN is only possible with floats and I have the feeling that most of the compositors only work with NaNs as masks. It seems that only ColormapCompositor cares about "_FillValue". That's why I've done the float conversation as it is already the case if alpha channels are present.
Maybe I should introduce a new reader parameter and let the user decide how to handle nodatavals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just stumbled across the following paragraph in the satpy docs (https://satpy.readthedocs.io/en/latest/dev_guide/custom_reader.html)
Single precision floats (np.float32) is a good compromise, as it has 23 significant bits (mantissa) and can thus represent 16 bit integers exactly, as well as keeping the memory footprint half of a double precision float.
One commonly used method in readers is xarray.DataArray.where() (to mask invalid data) which can be coercing the data to np.float64. To ensure for example that integer data is coerced to np.float32 when xarray.DataArray.where() is used, you can do ..
So converting to floats to mask data with NaN seems a common practice. As noted in the docs, float32 would be better. Conversion to floats is already done few lines above when alpha channels are handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is more specifically talking about integer data that is in the file as integers and needs to be scaled (ex. data * scale_factor + add_offset
) to floats, but also has a fill value. In that case you'd need to do the .where
call to mask out the floats with some other value. In Satpy that is NaN for floating point data and any _FillValue
for integer data. In older versions of numpy (I think newer versions have fixed this) if you do .where
with np.nan
it would cast 32-bit floats to 64-bit floats. Now it knows that you really mean 32-bit float -> 32-bit float NaN. For integers, they would always be cast to 64-bit floats so the note in the documentation is trying to say "there is no reason for that so force 32-bit float conversion".
The note should probably be updated to refer to integer fields that are meant to stay integers like cloud type products or in this case 8-bit unsigned integer RGB images. If an RGB is being read in by Satpy and is provided as floating point data (instead of as an integer) then it should probably be normalized to a 0-1 range as that is the assumption by the trollimage package when saving images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how to go one here. When I have an image (i.e. 1 band, 8bit int), I do not know without prior knowledge whether the values represent categories or not. Additionally, maybe scaling (according to parameters store in the file itself) is wanted or normalization to [0..1] - or nothing at all. At the moment (without the changes in this PR) no scaling/normalization/calibration is applied. They are returned as stored in the file.
I think the behaviour of the reader should be parametrized. I'm not very familiar with this but looking at other readers, the dataset sections in the reader yaml might be a way. Stuff like calibration, units, fill_value, etc. seems to be specified there.
Maybe we can introduce a parameter like mask_type: floatNan
to define what the reader should do.
I don't want to risk any problems for other users of this reader. Therefore default behaviour would be to do nothing (no type conversion and nan-masking) and the "float-nan-masking" is only active when it's enabled in the reader config.
Are you okay with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things like scale factor and add offset in a geotiff are not standard so I don't think we should add any support for that. Support for that should be added in a separate reader (a custom reader created by the user).
Indeed there are a lot of attributes that are not standarized (geotiff and beyond). I do not really care about the scaling or offsets at the moment (only masks :-) ). But I think that readers in general should provide the possibility to output data that is comparable to data provided by other readers. They should intruduce a level of abstraction from the files. It's not very convenient for users of satpy when you have to adjust/modify compositors when switching between different readers. IMHO readers (that share the same API) should also return similar data structures or at least can be configured to do so.
In other readers, things like units are just metadata to fill in the space of missing information in the file. In those cases, the reader is specific to a particular implementation of a data file (VIIRS SDR in the HDF5 format versus generic HDF5 files) or are representing a standard (CF compliant NetCDF files). This generic image reader should try to use standard geotiff features.
This reader is not only used to read GeoTiff. As far as I see, there are filetypes (png, ..) that this reader can handle. But then it is even more necessary to provide more reader options to specified how to handle this data.
Maybe these kind of reader parametrizations are then highly use case dependent and do not belong to "core" satpy (aka satpy/etc/readers/xxx.yaml
) but the reader implementation in satpy (satpy/readers/xxx.py
) should offer the set of tools (parameters) to control the handling of the input data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO readers (that share the same API) should also return similar data structures or at least can be configured to do so.
Yeah, sure, but the question is what is the point of this pull request and how much work do you want to include in it? I am totally fine with the python code allowing for all sorts of flexibility as long as it is documented and that the defaults work. This is the "generic" image reader so none of this extra functionality should be required to get something out of the reader. Adding options to a static list of datasets in the YAML sounds fine.
I think this reader should fix the dataset naming so it can be overridden in the YAML. If it includes some minimal nodata handling for geotiffs, then great. Anything beyond that should probably go in a separate PR, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original use for which I wrote this reader was to read any image as an image and use it as a background for BackgroundCompositor
and to make it possible to resample them to a specific area. So in a way, I never considered it for anything else than categorical data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sure, but the question is what is the point of this pull request and how much work do you want to include in it? I am totally fine with the python code allowing for all sorts of flexibility as long as it is documented and that the defaults work. This is the "generic" image reader so none of this extra functionality should be required to get something out of the reader. Adding options to a static list of datasets in the YAML sounds fine.
I think this reader should fix the dataset naming so it can be overridden in the YAML. If it includes some minimal nodata handling for geotiffs, then great. Anything beyond that should probably go in a separate PR, right?
My latest commit should address this points. There's now an option to control the handling of nodata. Default is to do nothing with the data but copying the nodata value (first element) to dataset.attrs['_FillValue'].
Beside that you can add the option "nodata_handling" to the readers yaml dataset sections. If it is set to "nan_mask", the data is converted to float32 and masked with np.nan.
Anything beyond that should probably go in a separate PR, right?
Absolutely right. Anything else should be in another PR. But I need this together do get my stuff working and it's hard to handle multiple PRs addressing the same file.
If you have any proposals for better naming of this option - you're welcome :-)
Hope this commit fulfills the demands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original use for which I wrote this reader was to read any image as an image and use it as a background for
BackgroundCompositor
and to make it possible to resample them to a specific area. So in a way, I never considered it for anything else than categorical data.
It's good to have such reader. Everything starts small. I hope that your needs are also covered with this modifications and the code still doesn't get too complex. These are only small changes and it may be more confusing to implement this in a separate reader.
satpy/readers/generic_image.py
Outdated
logger.debug("Reading %s.", key) | ||
return self.file_content[key['name']] | ||
logger.debug("Reading 'image' (ignoring provided key %s).", key) | ||
return self.file_content['image'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about you either:
- Copy
self.file_content['image']
(with.copy()
method) and update the attributes with theinfo
dictionary. Likeimg_copy.attrs.update(info)
. I think this should update thename
and any otherDataID
information that someone might include in the YAML. - Keep the old way of doing
key['name']
but update theread
method to write the file content to the rightfile_content
key.
Option 1 is easiest. Option 2 provides the most flexibility in the future. @pnuu let us know if there is a reason not to move .read
to the get_dataset
call as that also seems like it could help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for @pnuu 's input and I'll change it accordingly. Option 2 seems nicer to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'image'
is there just to make things work with the existing way Satpy is designed to load data with a named dataset. In an image, there's only one dataset and scn.load()
should probably just load the image. But yeah, when loading data with multiple readers that'd not work.
I think option 2 is the better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to remove the read()
completely as it is only used in the ctor and move the reading stuff to get_dataset()
. Then there's also no need to keep an attribute self.file_content
any longer. The "How-To for custom readers" (https://satpy.readthedocs.io/en/stable/dev_guide/custom_reader.html) only demands implementing get_dataset()
.
And I couldn't find any other readers that use GenericImageFileHandler
as base class.
Keeping self.file_content
seems only necessary for caching reasons. But I think there are other mechanisms in satpy that care about this, right?
Okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding moving the functionality: sounds good as long as the code is clean.
Regarding file_content, that is likely just a common practice used from other readers. It also makes writing tests easier, at least in other readers, because complex NetCDF/HDF files can be described by a dictionary and put in file_content
. For this reader, maybe that isn't necessary. As long as the tests can be made to work with whatever you do then 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to lease reading in ctor as it is because otherwise get_area_def()
needs special treatment. The area is set during reading. Although it seems that get_area_def()
is called after get_dataset()
and therefore area should be already available in my tests - it don't know if it's the case in all situations..
The latest commit introduces a class attribute dataset_name
that is set to default image
in the read function. In derived fake readers you can provide own file_content and leave ds_name
empty to work with multiple datasets - if somebody really wants to do this..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI there was a semi-recent change in the base reader that changed the order of when the dataset was loaded and when the area was loaded. It used to be that the area was always loaded first (if I remember correctly), but now it is loaded after.
Could you add a description )to the description field) of what this PR really is about? Looking at the the code, I'm also unsure about the title of the PR. Maybe clarify the title too? |
I've added a description text (sorry for that). But I don't have a better title.. |
DeepCode's analysis on #0f9339 found:
Top issues
👉 View analysis in DeepCode’s Dashboard | Configure the bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good. Thanks for the work. I have a couple suggestions and changes I'd like to talk about that I've made comments about.
satpy/readers/generic_image.py
Outdated
|
||
|
||
def mask_image_data(data): | ||
"""Mask image data if alpha channel is present.""" | ||
def mask_image_data(data, info=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is info
ever not provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. This function is used in test_generic_image.py
. That's why I fear it is part of the public interface and used by somebody somewhere..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch. Unless @pnuu says otherwise, I'd say update the tests to pass {}
for info
and make info
a requirement parameter here.
If you have the time it would be nice to make this function private (_
prefix) and update the tests to not test this function specifically, but the higher level expected output of the reader instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What Dave said. And indeed, this ought to be private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
satpy/readers/generic_image.py
Outdated
try: | ||
nodata_handling = info['nodata_handling'] if info else NODATA_HANDLING_FILLVALUE | ||
except KeyError: | ||
nodata_handling = NODATA_HANDLING_FILLVALUE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming info
is never None
as it is always passed, then this block could be simplified to:
try: | |
nodata_handling = info['nodata_handling'] if info else NODATA_HANDLING_FILLVALUE | |
except KeyError: | |
nodata_handling = NODATA_HANDLING_FILLVALUE | |
nodata_handling = info.get('nodata_handling', NODATA_HANDLING_FILLVALUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, would be much nicer. If we consider mask_image_data
as private function and info as not none, I can change this.
fname = os.path.join(self.base_dir, 'test_l_nan_fillvalue.tif') | ||
scn = Scene(reader='generic_image', filenames=[fname]) | ||
scn.load(['image']) | ||
self.assertEqual(scn['image'].shape, (1, self.y_size, self.x_size)) | ||
self.assertEqual(np.sum(scn['image'].data[0][:10, :10].compute()), 0) | ||
|
||
fname = os.path.join(self.base_dir, 'test_l_nan_nofillvalue.tif') | ||
scn = Scene(reader='generic_image', filenames=[fname]) | ||
scn.load(['image']) | ||
self.assertEqual(scn['image'].shape, (1, self.y_size, self.x_size)) | ||
self.assertTrue(np.all(np.isnan(scn['image'].data[0][:10, :10].compute()))) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind separating these out into their own test functions. I think we've generally switched to not combining a ton of test cases into one function whenever possible. It makes it easier to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Co-authored-by: David Hoese <[email protected]>
….com/ch-k/satpy into fix-generic-image-reader-datasetname
satpy/readers/generic_image.py
Outdated
|
||
return data | ||
|
||
def __mask_image_data(self, data, info): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this become a class method? Couldn't/shouldn't this remain a function outside the class? Also, please use a single underscore. Double underscore has special handling in python and isn't needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, too much java in the last months. I'll fix it tomorrow morning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it.
|
||
def __init__(self, filename, filename_info, filetype_info, file_content, **kwargs): | ||
"""Get fake file content from 'get_test_content'.""" | ||
if GenericImageFileHandler is object: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will this import ever return object
? Also, does this class need to be inside the test function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's basically a copy of an other fake reader (from file test_hdf5_utils.py). I supposed it is best practice t odo so (without knowning why..). Of cource I can remove it. I've already noted that some the checkers marked this as issue.
It is an inner class because all the other test routines in this file do their imports in the method itself. I thought, it would be better to do the same here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think now that we are using pytest for tests it isn't as important to have the various imports in the test functions themselves. @mraspaud thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just removed if GenericImageFileHandler is object
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tihnk it's fine without the check
Last request: Could you look at fixing some of the style issues from codebeat and deepcode? It looks like one of the deepcode ones will be unfixable (the one regarding |
Fixed some of the issues. But I don't really know how to make codebeat happy. These two remaining issues are classified as info. I can't imagine that further reducing of ACB or branch nesting makes the code more readable and maintainable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks.
Thanks for making the attempt. And yeah, removing all the issues identified by these tools is just something you get used to and readability is arguable. |
This PR is about two issues. Sorry about mixing it, but it concerns the same file.
At the moment it is not possible to use the GenericImageFileHandler for reader configs that define multiple dataset names. This is an example:
That doesn't work because a hardcoded name
image
(https://github.com/pytroll/satpy/blob/master/satpy/readers/generic_image.py#L85) is used to set the data and the name used in the reader config is used to get the data. Therefore only one dataset with nameimage
is possible. Otherwise the reader complains that the dataset is not available.The fill value of GeoTIFF files is not used to mask the data. With this PR,
fill value
pixels are set to nan as it is already the case when there are alpha channels present.AUTHORS.md
if not there already