-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/test file lock #652
Conversation
use setUpClass and tearDownClass for testfiles
Good catch! This seems to have worked before because the file just stayed open for the entire time an associated script ran. However, I am not sure about the fix. I think it's overly complicated. According to the xarray docs, the best pattern is a # NOT allowed anymore:
hazard = Hazard.from_raster_xarray("dset.nc")
# NEW pattern:
with xr.open_dataset("dset.nc") as file:
hazard = Hazard.from_raster_xarray(file)
# 'file' is closed here |
I agree on having only one type for data. But I wonder whether that type shouldn't rather be the file instead of the |
The only one I know of is in my new river flood module, see https://climada-petals--64.org.readthedocs.build/en/64/glofas_rf.html#executing-the-pipeline 🙈 It's true that we don't necessarily need it if we don't have (m)any use cases. But on the other hand, it could be overly restrictive if one has a dataset and needs to write it to a file to convert it into a Hazard. The other consideration is performance. As mentioned in the Okay, let's follow your approach, but as simple as possible. We define two methods, one which takes the Dataset and one which takes a file path. Both are public, names are up for discussion. The latter one then calls the first and has a reduced signature: @classmethod
def from_raster_xr_file(cls, filepath, **kwargs):
"""See from_raster_xr"""
with xr.open_dataset(filepath) as dset:
return cls.from_raster_xr(dset, **kwargs)
@classmethod
def from_raster_xr(
cls,
data: xr.Dataset,
hazard_type: str,
intensity_unit: str,
intensity: str,
coordinate_vars: Optional[Dict[str, str]],
data_vars: Optional[Dict[str, str]],
crs: str,
rechunk: bool
):
"""Full docstring goes here"""
# Load data Sometimes I miss C++, overloading would be great here 🙃 |
climada/hazard/base.py
Outdated
if not isinstance(data, xr.Dataset): | ||
LOGGER.info("Loading Hazard from file: %s", data) | ||
data: xr.Dataset = xr.open_dataset(data, chunks="auto") | ||
data_is_mine = True | ||
else: | ||
LOGGER.info("Loading Hazard from xarray Dataset") | ||
data_is_mine = False | ||
try: | ||
return cls._from_raster_xarray_dataset(data, hazard_type, | ||
intensity_unit, intensity, coordinate_vars, data_vars, crs, rechunk) | ||
finally: | ||
if data_is_mine: | ||
data.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not isinstance(data, xr.Dataset): | |
LOGGER.info("Loading Hazard from file: %s", data) | |
data: xr.Dataset = xr.open_dataset(data, chunks="auto") | |
data_is_mine = True | |
else: | |
LOGGER.info("Loading Hazard from xarray Dataset") | |
data_is_mine = False | |
try: | |
return cls._from_raster_xarray_dataset(data, hazard_type, | |
intensity_unit, intensity, coordinate_vars, data_vars, crs, rechunk) | |
finally: | |
if data_is_mine: | |
data.close() | |
if not isinstance(data, xr.Dataset): | |
LOGGER.info("Loading Hazard from file: %s", data) | |
with xr.open_dataset(data, chunks="auto") as dataset: | |
return cls._from_raster_xarray_dataset(dataset, hazard_type, | |
intensity_unit, intensity, coordinate_vars, data_vars, crs, rechunk) | |
else: | |
return cls._from_raster_xarray_dataset(data, hazard_type, | |
intensity_unit, intensity, coordinate_vars, data_vars, crs, rechunk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, overloading... 🤔 I'm not a C++ person but wouldn't it in principle boil down to the suggestion above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. This approach has the advantage that it works the same way with both types of inputs. But it has the following disadvantages, in my opinion:
- It defines a "private" classmethod. As classmethods instantiate objects, they should never be marked as "private".
- Although "private", this classmethod needs to define all parameters and contain the docstring, such that implementation and documentation are not separated.
- The documentation has to either be doubled in the public classmethod, or its docs must refer to the private method that actually indicates users not to touch it
I would rather go for the two-method approach. Users then must take the somewhat concious decision to either open a file themselves or let Climada do that, and choose one of the methods accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to the two-method approach, but I can't quite follow the argumentation:
- why should classmethods never be "private" (in this case, "protected" strictly speaking)?
- I wouldn't want to have a docstring for any "protected" method. They're not supposed to be used by anyone but the package developers and hence shouldn't show up in the API reference. But adding a docstring leads to just that: the method is listed even though it's "protected".
- with the two-methods approach the documentation is de facto doubled - so there is nothing to gain even if we did add a docstring to the "protected" method (which we shouldn't imo, s.a.)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should classmethods never be "private"
A classmethod indicates a specialized "constructor" for an object. "Private" or "protected" means that the method should only be called by the class (or derived classes) itself. Why should a constructor of an object only be called by the object itself? This seems like an anti-pattern to me 🤔
I wouldn't want to have a docstring for any "protected" method.
That's exactly my point. You have the doctring in the "public" method, but all the implementation is actually in the "protected" method, which does not have a docstring. This separates docs from code, which I find dangerous.
with the two-methods approach the documentation is de facto doubled
No, you just link to the other public method. Compare, e.g., xr.load_dataset
and xr.open_dataset
, I would do it exactly like that.
I'm not opposed to the two-method approach
Shall I come up with a proposal? What do you think about the method names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should a constructor of an object only be called by the object itself?
To me it's still a pattern since, actually, it's called by the class and not the object.
This separates docs from code, which I find dangerous.
Despite the danger I prefer stable docs for the exposed interface alone (exposed as in API reference). How the stuff is dealt with internally must not bother the user and developers are welcome to change it without much fuzz and without fear of disruption.
Public documentation of private methods is kind of corrupting the whole idea of privacy. 😉
Besides, there are dozens of protected methods in CLIMADA without a docstring, so this would not come as a novelty.
Having said that - yes, go ahead. 😃
About the names I'm pretty indifferent, however I prefer *_xarray_*
over *_xr_*
and if it's my choice I place xarray
before raster
, like from_xarray_raster_file
. Your call.
Split the existing classmethod into two methods 'from_xarray_raster' and 'from_xarray_raster_file'. One first loads data from a dataset, the second opens a file and then calls the first. * Split classmethods. * Update unit tests. * Update docstrings.
* Use temporary directory for storing data instead of local directory. * Consistently open data files in context managers.
Done!
|
cls.tempdir = TemporaryDirectory() | ||
cls.netcdf_path = Path(cls.tempdir.name) / "default.nc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peanutfun - why the TemporaryDirectory
? It bears the risk of a disk space leak. 🤔
Also I've got a feeling that there is a lot of threading involved with it which makes debugging much harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, this should be absolutely fine. This is the exact use case for TemporaryDirectory
, because we can place a file into the least intrusive part of the system. The directory is usually created in a folder that is regularly cleaned up by the OS anyway (e.g., /tmp/
on macOS and Linux). I just found out that the directory is automatically deleted when the TemporaryDirectory
object is destroyed (so calling cleanup
is actually superfluous). See the tempfile docs:
On [...] destruction of the temporary directory object, the newly created temporary directory and all its contents are removed from the filesystem.
Additionally, cleanup errors raise exceptions, so one would notice any issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. cleanup
is perhaps even not just superfluous but also futile. At least this is what if felt like when debugging #637: finding out what went wrong and where was kind of tricky.
I doubt that automatic clean up is on by default on Windows. You'll get a hint when you're really out of space but that's it I think.
However, I agree, the upsides (no messing with the installation directory, careless naming of test files) outweight the downsides.
👍
# Conflicts: # CHANGELOG.md
Very nice! 😄 Thanks a lot. |
Changes proposed in this PR:
The
from_raster_xarray
class method opens anxarray.Dataset
from the given file (unless it is already provided asDataset
) but never closes it. This doesn't matter much on Linux as the io already takes care of it, but it matters on Windows where files are locked until Python exits. This PR solves the problem byDataset
in case it was opened at the end of the methodDataset
is used as argumentProbably unrelated to the described problem, the
test_base_xarray
module has been improved bysetUpClass
andtearDownClass
to create and remove test files instead ofsetUp
andtearDown
, which are called for again and again for each executed test.PR Author Checklist
develop
)PR Reviewer Checklist