Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/test file lock #652

Merged
merged 9 commits into from
Feb 17, 2023
Merged

Feature/test file lock #652

merged 9 commits into from
Feb 17, 2023

Conversation

emanuel-schmid
Copy link
Collaborator

@emanuel-schmid emanuel-schmid commented Feb 10, 2023

Changes proposed in this PR:
The from_raster_xarray class method opens an xarray.Dataset from the given file (unless it is already provided as Dataset) but never closes it. This doesn't matter much on Linux as the io already takes care of it, but it matters on Windows where files are locked until Python exits. This PR solves the problem by

  • closing the Dataset in case it was opened at the end of the method
  • adjusting test methods for the cases where a Dataset is used as argument

Probably unrelated to the described problem, the test_base_xarray module has been improved by

  • using setUpClass and tearDownClass to create and remove test files instead of setUp and tearDown, which are called for again and again for each executed test.

PR Author Checklist

PR Reviewer Checklist

@peanutfun
Copy link
Member

peanutfun commented Feb 14, 2023

Good catch! This seems to have worked before because the file just stayed open for the entire time an associated script ran. However, I am not sure about the fix. I think it's overly complicated. According to the xarray docs, the best pattern is a with statement. So I would propose to simply drop the option of passing a file path to from_raster_xarray. Instead, we advertise using the with pattern in the examples:

# NOT allowed anymore:
hazard = Hazard.from_raster_xarray("dset.nc")

# NEW pattern:
with xr.open_dataset("dset.nc") as file:
    hazard = Hazard.from_raster_xarray(file)

# 'file' is closed here

@emanuel-schmid
Copy link
Collaborator Author

I agree on having only one type for data. But I wonder whether that type shouldn't rather be the file instead of the Dataset.
Are there use cases where an xarray.Dataset containing Hazard data but not originating from a file is converted to a Hazard object?

@peanutfun
Copy link
Member

peanutfun commented Feb 14, 2023

Are there use cases where an xarray.Dataset containing Hazard data but not originating from a file is converted to a Hazard object?

The only one I know of is in my new river flood module, see https://climada-petals--64.org.readthedocs.build/en/64/glofas_rf.html#executing-the-pipeline 🙈 It's true that we don't necessarily need it if we don't have (m)any use cases. But on the other hand, it could be overly restrictive if one has a dataset and needs to write it to a file to convert it into a Hazard.

The other consideration is performance. As mentioned in the rechunk option of Hazard.from_raster_xarray, we can automatically rechunk a given dataset for best performance during the array flattening. But rechunking itself can be expensive and it is only possible once we opened the dataset and know the exact dimension names. Users may know them beforehand and can decide on chunking themselves. However, many users probably are unaware of these issues and they might never be a huge problem 😕

Okay, let's follow your approach, but as simple as possible. We define two methods, one which takes the Dataset and one which takes a file path. Both are public, names are up for discussion. The latter one then calls the first and has a reduced signature:

@classmethod
def from_raster_xr_file(cls, filepath, **kwargs):
    """See from_raster_xr"""
    with xr.open_dataset(filepath) as dset:
        return cls.from_raster_xr(dset, **kwargs)

@classmethod
def from_raster_xr(
    cls,
    data: xr.Dataset,
    hazard_type: str,
    intensity_unit: str,
    intensity: str,
    coordinate_vars: Optional[Dict[str, str]],
    data_vars: Optional[Dict[str, str]],
    crs: str,
    rechunk: bool
):
    """Full docstring goes here"""
    # Load data

Sometimes I miss C++, overloading would be great here 🙃

Comment on lines 651 to 663
if not isinstance(data, xr.Dataset):
LOGGER.info("Loading Hazard from file: %s", data)
data: xr.Dataset = xr.open_dataset(data, chunks="auto")
data_is_mine = True
else:
LOGGER.info("Loading Hazard from xarray Dataset")
data_is_mine = False
try:
return cls._from_raster_xarray_dataset(data, hazard_type,
intensity_unit, intensity, coordinate_vars, data_vars, crs, rechunk)
finally:
if data_is_mine:
data.close()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not isinstance(data, xr.Dataset):
LOGGER.info("Loading Hazard from file: %s", data)
data: xr.Dataset = xr.open_dataset(data, chunks="auto")
data_is_mine = True
else:
LOGGER.info("Loading Hazard from xarray Dataset")
data_is_mine = False
try:
return cls._from_raster_xarray_dataset(data, hazard_type,
intensity_unit, intensity, coordinate_vars, data_vars, crs, rechunk)
finally:
if data_is_mine:
data.close()
if not isinstance(data, xr.Dataset):
LOGGER.info("Loading Hazard from file: %s", data)
with xr.open_dataset(data, chunks="auto") as dataset:
return cls._from_raster_xarray_dataset(dataset, hazard_type,
intensity_unit, intensity, coordinate_vars, data_vars, crs, rechunk)
else:
return cls._from_raster_xarray_dataset(data, hazard_type,
intensity_unit, intensity, coordinate_vars, data_vars, crs, rechunk)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, overloading... 🤔 I'm not a C++ person but wouldn't it in principle boil down to the suggestion above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. This approach has the advantage that it works the same way with both types of inputs. But it has the following disadvantages, in my opinion:

  • It defines a "private" classmethod. As classmethods instantiate objects, they should never be marked as "private".
  • Although "private", this classmethod needs to define all parameters and contain the docstring, such that implementation and documentation are not separated.
  • The documentation has to either be doubled in the public classmethod, or its docs must refer to the private method that actually indicates users not to touch it

I would rather go for the two-method approach. Users then must take the somewhat concious decision to either open a file themselves or let Climada do that, and choose one of the methods accordingly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to the two-method approach, but I can't quite follow the argumentation:

  • why should classmethods never be "private" (in this case, "protected" strictly speaking)?
  • I wouldn't want to have a docstring for any "protected" method. They're not supposed to be used by anyone but the package developers and hence shouldn't show up in the API reference. But adding a docstring leads to just that: the method is listed even though it's "protected".
  • with the two-methods approach the documentation is de facto doubled - so there is nothing to gain even if we did add a docstring to the "protected" method (which we shouldn't imo, s.a.)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should classmethods never be "private"

A classmethod indicates a specialized "constructor" for an object. "Private" or "protected" means that the method should only be called by the class (or derived classes) itself. Why should a constructor of an object only be called by the object itself? This seems like an anti-pattern to me 🤔

I wouldn't want to have a docstring for any "protected" method.

That's exactly my point. You have the doctring in the "public" method, but all the implementation is actually in the "protected" method, which does not have a docstring. This separates docs from code, which I find dangerous.

with the two-methods approach the documentation is de facto doubled

No, you just link to the other public method. Compare, e.g., xr.load_dataset and xr.open_dataset, I would do it exactly like that.

I'm not opposed to the two-method approach

Shall I come up with a proposal? What do you think about the method names?

Copy link
Collaborator Author

@emanuel-schmid emanuel-schmid Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should a constructor of an object only be called by the object itself?

To me it's still a pattern since, actually, it's called by the class and not the object.

This separates docs from code, which I find dangerous.

Despite the danger I prefer stable docs for the exposed interface alone (exposed as in API reference). How the stuff is dealt with internally must not bother the user and developers are welcome to change it without much fuzz and without fear of disruption.
Public documentation of private methods is kind of corrupting the whole idea of privacy. 😉
Besides, there are dozens of protected methods in CLIMADA without a docstring, so this would not come as a novelty.

Having said that - yes, go ahead. 😃

About the names I'm pretty indifferent, however I prefer *_xarray_* over *_xr_* and if it's my choice I place xarray before raster, like from_xarray_raster_file. Your call.

emanuel-schmid and others added 3 commits February 15, 2023 10:39
Split the existing classmethod into two methods 'from_xarray_raster' and
'from_xarray_raster_file'. One first loads data from a dataset, the
second opens a file and then calls the first.

* Split classmethods.
* Update unit tests.
* Update docstrings.
* Use temporary directory for storing data instead of local directory.
* Consistently open data files in context managers.
@peanutfun
Copy link
Member

Done!

  • Created two public classmethods: from_xarray_raster and from_xarray_raster_file.
  • Updated the docstrings, referencing each other.
  • Updated tests to consistently use context managers for opening files. Added temporary directories instead of writing files next to the source files.

Comment on lines +44 to +45
cls.tempdir = TemporaryDirectory()
cls.netcdf_path = Path(cls.tempdir.name) / "default.nc"
Copy link
Collaborator Author

@emanuel-schmid emanuel-schmid Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peanutfun - why the TemporaryDirectory? It bears the risk of a disk space leak. 🤔
Also I've got a feeling that there is a lot of threading involved with it which makes debugging much harder.

Copy link
Member

@peanutfun peanutfun Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, this should be absolutely fine. This is the exact use case for TemporaryDirectory, because we can place a file into the least intrusive part of the system. The directory is usually created in a folder that is regularly cleaned up by the OS anyway (e.g., /tmp/ on macOS and Linux). I just found out that the directory is automatically deleted when the TemporaryDirectory object is destroyed (so calling cleanup is actually superfluous). See the tempfile docs:

On [...] destruction of the temporary directory object, the newly created temporary directory and all its contents are removed from the filesystem.

Additionally, cleanup errors raise exceptions, so one would notice any issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. cleanup is perhaps even not just superfluous but also futile. At least this is what if felt like when debugging #637: finding out what went wrong and where was kind of tricky.
I doubt that automatic clean up is on by default on Windows. You'll get a hint when you're really out of space but that's it I think.

However, I agree, the upsides (no messing with the installation directory, careless naming of test files) outweight the downsides.
👍

@emanuel-schmid emanuel-schmid merged commit 11a707b into develop Feb 17, 2023
@emanuel-schmid
Copy link
Collaborator Author

Very nice! 😄 Thanks a lot.

@emanuel-schmid emanuel-schmid deleted the feature/test-file-lock branch February 17, 2023 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants