-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add at_reg_id_event #642
add at_reg_id_event #642
Conversation
Co-authored-by: Lukas Riedel <[email protected]>
Co-authored-by: Lukas Riedel <[email protected]>
Co-authored-by: Lukas Riedel <[email protected]>
Co-authored-by: Lukas Riedel <[email protected]>
climada/engine/impact.py
Outdated
elif not isinstance(agg_regions, pd.Series): | ||
agg_regions = pd.Series(agg_regions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make it a pandas series instead of a numpy array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for two reasons:
- I find it cleaner as you know what impact belongs to what region simply by looking at the dataframe
- I like the
pd.Series.unique()
method because it retains the order. Doing the same withnumpy
is more cumbersome, to my knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that this is a misuse of a pandas series.
I find it cleaner as you know what impact belongs to what region simply by looking at the dataframe
That is independent on using numpy arrays or pandas series. But using a pandas series when it is not needed is imho actually less clean.
I like the pd.Series.unique() method because it retains the order. Doing the same with numpy is more cumbersome, to my knowledge.
I would argue that this is suboptimal as an important point is now hidden in the subtle internal working of the pandas.series.unique()
vs the np.unique()
methods.
@peanutfun : what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
on the first point you are right: having a
pd.Series
is not strictly needed to generate apd.DataFrame
-
on the second: I am transforming
agg_regions
intopd.Series
only when the user supplies it as annp.array
orlist
. But the user can supply apd.Series
in the first place. So I don't think it's a misuse, I just wrote the code in a way that it works with apd.Series
.
So I think your doubt is rather: should we use pandas
for things that can be done with numpy
? Not sure I have a clear answer to that, as this would probably apply to any code that make use of pandas
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that pd.series
are objects that are made to be used in pandas dataframes, but they are not made to be used a standalone objects. They are essentially numpy arrays, with some more. Thus, instead of making an unclear use of the pd.series
, I suggest to use numpy arrays, and to make the ordering requirement clear (as currently this is a hidden feature).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see how the order is important because the order in the agg_regions
parameter is determined by the "order" of the exposure points. Also, the order of columns in the final data frame does not appear relevant to me 😅
I would actually argue that the unordered result of pandas.Series.unique
does not need further explanation, whereas I feel we should add a note to the docstring that the unique items will be ordered when using np.unique
. So, I think the easiest solution is to just stick to the current implementation.
@chahank: Following our discussion on the "clean" indexing, I just realized that pandas.Series.unique
returns a numpy array, so everything is fine from that perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the order does not count. I am now using only numpy
arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that this is a misuse of a pandas series.
Not necessarily. According to the docs a Series
is a "one-dimensional ndarray
with axis labels (including time series)". Sounds useful also outside of a DataFrame
.
🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much agree with @emanuel-schmid here 😃
I'm fine with the current implementation. However, please make sure to only call np.unique
once, and store the result for later use. It is currently executed twice and will be costly for large arrays. I can also take care of that when merging. Are we ready to go ahead?
Co-authored-by: Chahan M. Kropf <[email protected]>
I guess we are ready to merge |
Yes, looks good. 😄 I did some cosmetics in the doc string. If your happy with them @aleeciu, let me know or just commit and merge. |
Co-authored-by: Emanuel Schmid <[email protected]>
Oh and one more thing we need to change the CHANGELOG.md |
Will do and handle the merge! 👌 |
This PR addresses #595