Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize concatenation of centroids #989

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from

Conversation

NicolasColombi
Copy link
Collaborator

@NicolasColombi NicolasColombi commented Dec 19, 2024

Changes proposed in this PR:

Refer to #988 for detailed information on the profiling results.

Modify hazard.append() function in climada/hazard/centroids/centr.py so that concatenation of centroids is made only once, when all centroids has already been appended, instead of concatenating every time a centroid is appended (extremely time consuming)

  • Create centroid attribute batch_gdf in hazard.append()
  • Append centroids to batch_gdf
  • Define concatenation function: finalize_append()
  • Call finalize_append() to concatenate at the end of the appending process
  • Clear batch_gdf attribute

This PR fixes #988

Optimize the function TropCyclone.from_tracks from 53 minutes to 4.03 minutes.

PR Author Checklist

PR Reviewer Checklist

@peanutfun
Copy link
Member

peanutfun commented Dec 19, 2024

@NicolasColombi Thanks for coming up with this bugfix! If I understand correctly, the problem is that union iterates over multiple others, and for each of these calls append, which in turn calls the costly pd.concat each time. Indeed, concat creates copies, so you should call it as rarely as possible with as many arguments as possible, see https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#concat.

You solve the issue by introducing a new attribute and function, which I think is error-prone. I would suggest that append should be updated to support a list of other centroids. union can then call append right away with the list.

   def append(self, *centr):
        for cc in centr:
            if not u_coord.equal_crs(self.crs, cc.crs):
                raise ValueError(
                    f"The given centroids use different CRS: {self.crs}, {cc.crs}. "
                    "The centroids are incompatible and cannot be concatenated."
                )
        self.gdf = pd.concat([self.gdf] + [cc.gdf for cc in centr])

I think this is a better solution because calling append (and hence concat) multiple times seems to be bad practice, and with the new signature we suggest doing otherwise.

@NicolasColombi
Copy link
Collaborator Author

NicolasColombi commented Dec 19, 2024

@NicolasColombi Thanks for coming up with this bugfix! If I understand correctly, the problem is that union iterates over multiple others, and for each of these calls append, which in turn calls the costly pd.concat each time. Indeed, concat creates copies, so you should call it as rarely as possible with as many arguments as possible, see https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#concat.

You solve the issue by introducing a new attribute and function, which I think is error-prone. I would suggest that append should be updated to support a list of other centroids. union can then call append right away with the list.

   def append(self, *centr):
        for cc in centr:
            if not u_coord.equal_crs(self.crs, cc.crs):
                raise ValueError(
                    f"The given centroids use different CRS: {self.crs}, {cc.crs}. "
                    "The centroids are incompatible and cannot be concatenated."
                )
        self.gdf = pd.concat([self.gdf] + [cc.gdf for cc in centr])

I think this is a better solution because calling append (and hence concat) multiple times seems to be bad practice, and with the new signature we suggest doing otherwise.

Hi @peanutfun, fine by me! With the way you propose we could even get rid completely of union() correct ? As if I understand you right, you would make append() do the job of union() with the constraints of concatenating the already appended list (hence, doing it only once and thus being fast). Or as you said, we make union call append with directly the list, but then, union() wouldn't do anything else except calling append(), or am I missing something ?

@NicolasColombi
Copy link
Collaborator Author

@NicolasColombi Thanks for coming up with this bugfix! If I understand correctly, the problem is that union iterates over multiple others, and for each of these calls append, which in turn calls the costly pd.concat each time. Indeed, concat creates copies, so you should call it as rarely as possible with as many arguments as possible, see https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#concat.
You solve the issue by introducing a new attribute and function, which I think is error-prone. I would suggest that append should be updated to support a list of other centroids. union can then call append right away with the list.

   def append(self, *centr):
        for cc in centr:
            if not u_coord.equal_crs(self.crs, cc.crs):
                raise ValueError(
                    f"The given centroids use different CRS: {self.crs}, {cc.crs}. "
                    "The centroids are incompatible and cannot be concatenated."
                )
        self.gdf = pd.concat([self.gdf] + [cc.gdf for cc in centr])

I think this is a better solution because calling append (and hence concat) multiple times seems to be bad practice, and with the new signature we suggest doing otherwise.

Hi @peanutfun, fine by me! With the way you propose we could even get rid completely of union() correct ? As if I understand you right, you would make append() do the job of union() with the constraints of concatenating the already appended list (hence, doing it only once and thus being fast). Or as you said, we make union call append with directly the list, but then, union() wouldn't do anything else except calling append(), or am I missing something ?

@peanutfun
Copy link
Member

With the way you propose we could even get rid completely of union() correct ?

Not necessarily. It's a bit unclear, but append and union do different things:

  • append adds more centroids to the current object
  • union creates a copy of the current object, adds more centroids, removes duplicate points, and returns the new object

Arguably, append is a bit superfluous. But I would not want to remove an established function as a reaction to a bug report.

@NicolasColombi NicolasColombi marked this pull request as ready for review December 19, 2024 15:18
@NicolasColombi
Copy link
Collaborator Author

Not necessarily. It's a bit unclear, but append and union do different things:

Yes, that is why I thought that a single function could do both, but as you mentioned it is probably better to keep them separate, one of the reason being that all tests pass and that it shouldn't compromise the rest of the code this way.

Copy link
Member

@peanutfun peanutfun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR! It looks much cleaner now. Please fix the remaining small issues and update the PR description.

climada/hazard/centroids/centr.py Outdated Show resolved Hide resolved
"The centroids are incompatible and cannot be concatenated."
)
self.gdf = pd.concat([self.gdf, centr.gdf])
for cc in centr:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the linter issue. cc is too short, maybe use other?

@@ -331,11 +331,16 @@ def from_pnt_bounds(cls, points_bounds, res, crs=DEF_CRS):
}
)

def append(self, centr):
"""Append Centroids
def append(self, *centr):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a dedicated test for append with multiple arguments

CHANGELOG.md Outdated Show resolved Hide resolved
@NicolasColombi
Copy link
Collaborator Author

@peanutfun I added a test to verify that append() can accept multiple arguments, the docs looks good with :py:meth:, the changelog is also updated now as you suggested.

Co-authored-by: Lukas Riedel <[email protected]>
Copy link
Collaborator

@bguillod bguillod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this improvements, it makes a lot of sense and the performance gain is significant indeed! A few small comments but looks good to me overall.

centr : Centroids
Centroids to append. The centroids need to have the same CRS.
centr : list
List of Centroids to append. The centroids need to have the same CRS.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since centr is given as a *args, it does not have to be a list but can be several centroids, no? Formulated differently, can this method be called these two ways with equivalent results or does only one way work?

centr.append(new_centrs1, new_centrs2)
centr.append([new_centrs1, new_centrs2])

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, the code as currently written would work if you call either centr.append(new_centrs1, new_centrs2) or centr.append(*[new_centrs1, new_centrs2]) but would not work calling centr.append([new_centrs1, new_centrs2]), since it wouldn't unpack the list into single centroid objects before the loop for other in centr which checks the CRS.

@NicolasColombi and @peanutfun, should we maybe consider adapting the method so that it would also work when passing a list without *?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Benoit, I think you are right, I will update the docs strings, it shouldn't be a list.

others : list of Centroids
Centroids contributing to the union.
others : list
List of Centroids contributing to the union.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as above: if it's an *args then it's not a list, no?

Copy link
Collaborator

@sarah-hlsn sarah-hlsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good to me, thanks a lot!
Just fixed some minor typos in the docstrings.

As @bguillod highlighted, there could be some inconsistency associated with how the centr argument is passed to the method/how it is defined in the docstrings. I think we should either add a fix to support different ways of passing this argument or clarify the docstrings (if my understanding is correct here, happy to defer to your expertise @NicolasColombi & @peanutfun).

climada/hazard/centroids/centr.py Outdated Show resolved Hide resolved
climada/hazard/centroids/centr.py Outdated Show resolved Hide resolved
climada/hazard/centroids/test/test_centr.py Outdated Show resolved Hide resolved
centr : Centroids
Centroids to append. The centroids need to have the same CRS.
centr : list
List of Centroids to append. The centroids need to have the same CRS.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, the code as currently written would work if you call either centr.append(new_centrs1, new_centrs2) or centr.append(*[new_centrs1, new_centrs2]) but would not work calling centr.append([new_centrs1, new_centrs2]), since it wouldn't unpack the list into single centroid objects before the loop for other in centr which checks the CRS.

@NicolasColombi and @peanutfun, should we maybe consider adapting the method so that it would also work when passing a list without *?

@NicolasColombi
Copy link
Collaborator Author

@sarah-hlsn exactly! I personally wouldn't modify it so that it can accept a list.

@sarah-hlsn
Copy link
Collaborator

@sarah-hlsn exactly! I personally wouldn't modify it so that it can accept a list.

Makes perfect sense, we can just adapt the docstrings as you say :)

NicolasColombi and others added 4 commits December 23, 2024 11:10
Update docstrings to clarify the type of argument of union and append
@NicolasColombi
Copy link
Collaborator Author

May I merge ? 👍 @peanutfun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TropCyclone.from_tracks() takes considerably longer in climada 5.0.0
4 participants