Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using propertynames on GroupedDataFrame #3443

Closed
matthieugomez opened this issue Jun 19, 2024 · 6 comments · Fixed by #3464
Closed

using propertynames on GroupedDataFrame #3443

matthieugomez opened this issue Jun 19, 2024 · 6 comments · Fixed by #3464
Labels
Milestone

Comments

@matthieugomez
Copy link
Contributor

matthieugomez commented Jun 19, 2024

The documentation discusses the fact that one can use either names and propertynames to get the name of variables in DataFrame, where the first function returns a vector of strings and the second one returns a vector symbols. However, for a GroupedDataFrame, propertynames returns (:parent, :cols, :groups, :idx, :starts, :ends, :ngroups, :keymap, :lazy_lock). I found this confusing and to solve this, I think it would be good to define

Base.propertynames(df::GroupedDataFrame) = propertynames(parent(df))
@matthieugomez matthieugomez changed the title propertynames on GroupedDataFrame should return a vector of symbol using propertynames on GroupedDataFrame Jun 22, 2024
@bkamins bkamins added this to the 2.0 milestone Jul 3, 2024
@bkamins
Copy link
Member

bkamins commented Jul 3, 2024

I understand you have an issue about the inconsistency between names and propertynames.

The current design is based on the following reasoning:

  • the names function does not promise anything, it only promises to return names of columns. Therefore it can be defined on GroupedDataFrame.
  • the propertynames on the other hand has a promise. It promises that if you write gdf.x when :x was returned as a property of gdf object you get a meaningful result. The problem is that it is not clear what gdf.x should return if gdf is a GroupedDataFrame having a column :x. We cannot just return parent(gdf).x as it is not the same. For this reason currently propertynames and getproperty was left untouched from the default.

However, for the future we could discuss changing it provided that we find some meaningful value that would be returned by gdf.x.

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Jul 16, 2024

Maybe an alternative would be to have a better suggestion than propertynames to return the column names as symbol? Does Symbol.(names()) always return the right answer? If so, I would suggest to use that syntax in the doc instead of propertynames

@bkamins
Copy link
Member

bkamins commented Aug 3, 2024

you can use propertynames(parent(gdf)). I think this is the preferred pattern (as it falls back to propertynames and gives a right context that gdf is a view of its parent. I understand you would put this suggestion in names docstring?

@bkamins bkamins modified the milestones: 2.0, 1.7 Aug 3, 2024
@bkamins bkamins added doc and removed decision labels Aug 3, 2024
@matthieugomez
Copy link
Contributor Author

matthieugomez commented Aug 10, 2024

Yes, that would improve things.

That being said, I think this syntax is quite complicated to remember (and, tbh, that the name propertynames is a bit weird in the context of DataFrames). Would it be possible to create/suggest an alternative syntax? For instance, could not you recommend people to use Symbol.(names) instead of propertynames? If this gives an incorrect result in some situation (which would actually be nice to flag so that people can avoid bugs), how about mimicking CSV.read by adding an optional argument to the function names, such as names(df, Symbol)?

@bkamins
Copy link
Member

bkamins commented Aug 11, 2024

propertynames is a bit weird in the context of DataFrames

This is not a function introduced by DataFrames.jl. It is a Base function that returns names of properties of an object. And column names as symbols are names of properties of DataFrame.


In general the current design is to use strings as column names as people found it easier to work with. The use of Symbol is a bit of a legacy (indeed, it is a bit faster, but the difference is minimal).

We could add Symbol.(names(df)) suggestion.

@matthieugomez
Copy link
Contributor Author

Great — I think suggesting Symbol.(names(df)) is the best option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants