Feature/test file lock #652

emanuel-schmid · 2023-02-10T14:14:56Z

Changes proposed in this PR:
The from_raster_xarray class method opens an xarray.Dataset from the given file (unless it is already provided as Dataset) but never closes it. This doesn't matter much on Linux as the io already takes care of it, but it matters on Windows where files are locked until Python exits. This PR solves the problem by

closing the Dataset in case it was opened at the end of the method
adjusting test methods for the cases where a Dataset is used as argument

Probably unrelated to the described problem, the test_base_xarray module has been improved by

using setUpClass and tearDownClass to create and remove test files instead of setUp and tearDown, which are called for again and again for each executed test.

PR Author Checklist

PR Reviewer Checklist

use setUpClass and tearDownClass for testfiles

peanutfun · 2023-02-14T09:16:43Z

Good catch! This seems to have worked before because the file just stayed open for the entire time an associated script ran. However, I am not sure about the fix. I think it's overly complicated. According to the xarray docs, the best pattern is a with statement. So I would propose to simply drop the option of passing a file path to from_raster_xarray. Instead, we advertise using the with pattern in the examples:

# NOT allowed anymore:
hazard = Hazard.from_raster_xarray("dset.nc")

# NEW pattern:
with xr.open_dataset("dset.nc") as file:
    hazard = Hazard.from_raster_xarray(file)

# 'file' is closed here

emanuel-schmid · 2023-02-14T11:31:30Z

I agree on having only one type for data. But I wonder whether that type shouldn't rather be the file instead of the Dataset.
Are there use cases where an xarray.Dataset containing Hazard data but not originating from a file is converted to a Hazard object?

peanutfun · 2023-02-14T13:19:04Z

Are there use cases where an xarray.Dataset containing Hazard data but not originating from a file is converted to a Hazard object?

The only one I know of is in my new river flood module, see https://climada-petals--64.org.readthedocs.build/en/64/glofas_rf.html#executing-the-pipeline 🙈 It's true that we don't necessarily need it if we don't have (m)any use cases. But on the other hand, it could be overly restrictive if one has a dataset and needs to write it to a file to convert it into a Hazard.

The other consideration is performance. As mentioned in the rechunk option of Hazard.from_raster_xarray, we can automatically rechunk a given dataset for best performance during the array flattening. But rechunking itself can be expensive and it is only possible once we opened the dataset and know the exact dimension names. Users may know them beforehand and can decide on chunking themselves. However, many users probably are unaware of these issues and they might never be a huge problem 😕

Okay, let's follow your approach, but as simple as possible. We define two methods, one which takes the Dataset and one which takes a file path. Both are public, names are up for discussion. The latter one then calls the first and has a reduced signature:

@classmethod
def from_raster_xr_file(cls, filepath, **kwargs):
    """See from_raster_xr"""
    with xr.open_dataset(filepath) as dset:
        return cls.from_raster_xr(dset, **kwargs)

@classmethod
def from_raster_xr(
    cls,
    data: xr.Dataset,
    hazard_type: str,
    intensity_unit: str,
    intensity: str,
    coordinate_vars: Optional[Dict[str, str]],
    data_vars: Optional[Dict[str, str]],
    crs: str,
    rechunk: bool
):
    """Full docstring goes here"""
    # Load data

Sometimes I miss C++, overloading would be great here 🙃

emanuel-schmid · 2023-02-14T14:55:58Z

climada/hazard/base.py

        if not isinstance(data, xr.Dataset):
            LOGGER.info("Loading Hazard from file: %s", data)
            data: xr.Dataset = xr.open_dataset(data, chunks="auto")
+            data_is_mine = True
        else:
            LOGGER.info("Loading Hazard from xarray Dataset")
+            data_is_mine = False
+        try:
+            return cls._from_raster_xarray_dataset(data, hazard_type,
+                intensity_unit, intensity, coordinate_vars, data_vars, crs, rechunk)
+        finally:
+            if data_is_mine:
+                data.close()


Suggested change

if not isinstance(data, xr.Dataset):

LOGGER.info("Loading Hazard from file: %s", data)

data: xr.Dataset = xr.open_dataset(data, chunks="auto")

data_is_mine = True

else:

LOGGER.info("Loading Hazard from xarray Dataset")

data_is_mine = False

try:

return cls._from_raster_xarray_dataset(data, hazard_type,

intensity_unit, intensity, coordinate_vars, data_vars, crs, rechunk)

finally:

if data_is_mine:

data.close()

if not isinstance(data, xr.Dataset):

LOGGER.info("Loading Hazard from file: %s", data)

with xr.open_dataset(data, chunks="auto") as dataset:

return cls._from_raster_xarray_dataset(dataset, hazard_type,

intensity_unit, intensity, coordinate_vars, data_vars, crs, rechunk)

else:

return cls._from_raster_xarray_dataset(data, hazard_type,

intensity_unit, intensity, coordinate_vars, data_vars, crs, rechunk)

Hmm, overloading... 🤔 I'm not a C++ person but wouldn't it in principle boil down to the suggestion above?

Indeed. This approach has the advantage that it works the same way with both types of inputs. But it has the following disadvantages, in my opinion:

It defines a "private" classmethod. As classmethods instantiate objects, they should never be marked as "private".

Although "private", this classmethod needs to define all parameters and contain the docstring, such that implementation and documentation are not separated.

The documentation has to either be doubled in the public classmethod, or its docs must refer to the private method that actually indicates users not to touch it

I would rather go for the two-method approach. Users then must take the somewhat concious decision to either open a file themselves or let Climada do that, and choose one of the methods accordingly.

I'm not opposed to the two-method approach, but I can't quite follow the argumentation:

why should classmethods never be "private" (in this case, "protected" strictly speaking)?

I wouldn't want to have a docstring for any "protected" method. They're not supposed to be used by anyone but the package developers and hence shouldn't show up in the API reference. But adding a docstring leads to just that: the method is listed even though it's "protected".

with the two-methods approach the documentation is de facto doubled - so there is nothing to gain even if we did add a docstring to the "protected" method (which we shouldn't imo, s.a.)?

why should classmethods never be "private"

A classmethod indicates a specialized "constructor" for an object. "Private" or "protected" means that the method should only be called by the class (or derived classes) itself. Why should a constructor of an object only be called by the object itself? This seems like an anti-pattern to me 🤔

I wouldn't want to have a docstring for any "protected" method.

That's exactly my point. You have the doctring in the "public" method, but all the implementation is actually in the "protected" method, which does not have a docstring. This separates docs from code, which I find dangerous.

with the two-methods approach the documentation is de facto doubled

No, you just link to the other public method. Compare, e.g., xr.load_dataset and xr.open_dataset, I would do it exactly like that.

I'm not opposed to the two-method approach

Shall I come up with a proposal? What do you think about the method names?

Why should a constructor of an object only be called by the object itself?

To me it's still a pattern since, actually, it's called by the class and not the object.

This separates docs from code, which I find dangerous.

Despite the danger I prefer stable docs for the exposed interface alone (exposed as in API reference). How the stuff is dealt with internally must not bother the user and developers are welcome to change it without much fuzz and without fear of disruption.
Public documentation of private methods is kind of corrupting the whole idea of privacy. 😉
Besides, there are dozens of protected methods in CLIMADA without a docstring, so this would not come as a novelty.

Having said that - yes, go ahead. 😃

About the names I'm pretty indifferent, however I prefer *_xarray_* over *_xr_* and if it's my choice I place xarray before raster, like from_xarray_raster_file. Your call.

Split the existing classmethod into two methods 'from_xarray_raster' and 'from_xarray_raster_file'. One first loads data from a dataset, the second opens a file and then calls the first. * Split classmethods. * Update unit tests. * Update docstrings.

* Use temporary directory for storing data instead of local directory. * Consistently open data files in context managers.

peanutfun · 2023-02-16T10:54:59Z

Done!

Created two public classmethods: from_xarray_raster and from_xarray_raster_file.
Updated the docstrings, referencing each other.
Updated tests to consistently use context managers for opening files. Added temporary directories instead of writing files next to the source files.

emanuel-schmid · 2023-02-16T16:06:29Z

climada/hazard/test/test_base_xarray.py

+        cls.tempdir = TemporaryDirectory()
+        cls.netcdf_path = Path(cls.tempdir.name) / "default.nc"


@peanutfun - why the TemporaryDirectory? It bears the risk of a disk space leak. 🤔
Also I've got a feeling that there is a lot of threading involved with it which makes debugging much harder.

I don't think so, this should be absolutely fine. This is the exact use case for TemporaryDirectory, because we can place a file into the least intrusive part of the system. The directory is usually created in a folder that is regularly cleaned up by the OS anyway (e.g., /tmp/ on macOS and Linux). I just found out that the directory is automatically deleted when the TemporaryDirectory object is destroyed (so calling cleanup is actually superfluous). See the tempfile docs:

On [...] destruction of the temporary directory object, the newly created temporary directory and all its contents are removed from the filesystem.

Additionally, cleanup errors raise exceptions, so one would notice any issues.

I see. cleanup is perhaps even not just superfluous but also futile. At least this is what if felt like when debugging #637: finding out what went wrong and where was kind of tricky.
I doubt that automatic clean up is on by default on Windows. You'll get a hint when you're really out of space but that's it I think.

However, I agree, the upsides (no messing with the installation directory, careless naming of test files) outweight the downsides.
👍

# Conflicts: # CHANGELOG.md

emanuel-schmid · 2023-02-17T12:37:30Z

Very nice! 😄 Thanks a lot.

emanuel-schmid added 2 commits February 10, 2023 14:59

hazard.from_raster_xarray: close opened dataset at last

4d52ecf

hazard.test_base_xarray: close datasets

7e052b3

use setUpClass and tearDownClass for testfiles

emanuel-schmid requested a review from peanutfun February 10, 2023 14:19

CHANGELOG: added from_raster_xarray

f61f095

emanuel-schmid commented Feb 14, 2023

View reviewed changes

emanuel-schmid and others added 3 commits February 15, 2023 10:39

hazard.Hazard.from_raster_xarray: idiomatic use of with open

1aaf64c

Improve tests for Hazard.from_xarray_raster(_file)

52e3f28

* Use temporary directory for storing data instead of local directory. * Consistently open data files in context managers.

peanutfun added 2 commits February 16, 2023 11:56

Fix linter issue "no-else-raise"

1d964a6

Update CHANGELOG.md

3363c67

peanutfun approved these changes Feb 16, 2023

View reviewed changes

emanuel-schmid commented Feb 16, 2023

View reviewed changes

Merge branch 'develop' into feature/test-file-lock

501e333

# Conflicts: # CHANGELOG.md

emanuel-schmid merged commit 11a707b into develop Feb 17, 2023

emanuel-schmid deleted the feature/test-file-lock branch February 17, 2023 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/test file lock #652

Feature/test file lock #652

emanuel-schmid commented Feb 10, 2023 •

edited

Loading

peanutfun commented Feb 14, 2023 •

edited

Loading

emanuel-schmid commented Feb 14, 2023

peanutfun commented Feb 14, 2023 •

edited

Loading

emanuel-schmid Feb 14, 2023

emanuel-schmid Feb 14, 2023

peanutfun Feb 14, 2023

emanuel-schmid Feb 15, 2023

peanutfun Feb 15, 2023

emanuel-schmid Feb 15, 2023 •

edited

Loading

peanutfun commented Feb 16, 2023

emanuel-schmid Feb 16, 2023 •

edited

Loading

peanutfun Feb 16, 2023 •

edited

Loading

emanuel-schmid Feb 17, 2023

emanuel-schmid commented Feb 17, 2023

		cls.tempdir = TemporaryDirectory()
		cls.netcdf_path = Path(cls.tempdir.name) / "default.nc"

Feature/test file lock #652

Feature/test file lock #652

Conversation

emanuel-schmid commented Feb 10, 2023 • edited Loading

PR Author Checklist

PR Reviewer Checklist

peanutfun commented Feb 14, 2023 • edited Loading

emanuel-schmid commented Feb 14, 2023

peanutfun commented Feb 14, 2023 • edited Loading

emanuel-schmid Feb 14, 2023

Choose a reason for hiding this comment

emanuel-schmid Feb 14, 2023

Choose a reason for hiding this comment

peanutfun Feb 14, 2023

Choose a reason for hiding this comment

emanuel-schmid Feb 15, 2023

Choose a reason for hiding this comment

peanutfun Feb 15, 2023

Choose a reason for hiding this comment

emanuel-schmid Feb 15, 2023 • edited Loading

Choose a reason for hiding this comment

peanutfun commented Feb 16, 2023

emanuel-schmid Feb 16, 2023 • edited Loading

Choose a reason for hiding this comment

peanutfun Feb 16, 2023 • edited Loading

Choose a reason for hiding this comment

emanuel-schmid Feb 17, 2023

Choose a reason for hiding this comment

emanuel-schmid commented Feb 17, 2023

emanuel-schmid commented Feb 10, 2023 •

edited

Loading

peanutfun commented Feb 14, 2023 •

edited

Loading

peanutfun commented Feb 14, 2023 •

edited

Loading

emanuel-schmid Feb 15, 2023 •

edited

Loading

emanuel-schmid Feb 16, 2023 •

edited

Loading

peanutfun Feb 16, 2023 •

edited

Loading