Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate head and tail in favor of first and last #1607

Merged
merged 4 commits into from
Dec 1, 2018
Merged

Conversation

nalimilan
Copy link
Member

Now that we have decided that data frames are collections of rows (#1514), there is no point in defining separate head and tail functions, which are a legacy from the R API. Do not define one-argument methods defaulting to n=6 since other first and last methods in Base default to n=1, which should return a single row (as a NamedTuple or DataFrameRow).

Now that we have decided that data frames are collections of rows, there is no point in
defining separate head and tail functions, which are a legacy from the R API.
Do not define one-argument methods defaulting to n=6 since other first and last methods
in Base default to n=1, which should return a single row (as a NamedTuple or DataFrameRow).
@bkamins
Copy link
Member

bkamins commented Nov 24, 2018

There are two points to discuss:

Issue 1. Do we really want to get rid of head and tail. I will not oppose removing them, but how much benefit do we get by removing them (probably there is some global logic behind it - similarly to one behind wanting to remove nrow and ncol - was it discussed earlier and what are the main arguments for such an approach).

Issue 2. Actually if we want to do this change we should also define first(::AbstractDataFrame) and last(::AbstractDataFrame) as now the former throws an error and the latter returns last column of the data frame which is confusing.

EDIT especially as in Base first and last take the second argument only for strings and only because it is the only convenient way to get first few characters in a string (because of byte indexing). The confusing thing is that first and last drop a dimension while first(x, n) and last(x, n) do not drop it.

@pdeffebach
Copy link
Contributor

pdeffebach commented Nov 25, 2018

first and last imply iteration to me. df[1] would have to return the first row.

Also note that in R, head is useful because unlike the default show method, it prints all column names and types. Although this is usually illegible. We don't do that now, but maybe head could be show(df, allcols = true) to get more similar behavior.

In stata when you first get a dataset you write summarize, which is more akin to describe. Which is probably preferred imo since you learn more about a dataset. We should tell people to do that instead when they read in data for the first time.

But removing head would mean constantly reminding people that head doesn't exist. There's not much harm in keeping it.

@nalimilan
Copy link
Member Author

head and tail don't really harm, except that they work only for data frames, while Julia already has first and last to get the first/last elements in a collection. I expect the two-argument form to be supported for vectors in the future -- we can open an issue in Julia to discuss that before merging this PR if you want. Also, Base.tail already exists and does something different, and it could be exported at some point.

Good catch about last(df) returning the last column. Better define first(df) and last(df) to return the first/last row, using df[i, :]. I've added a commit to do that. I don't think the inconsistency with df[i] is a problem: df[i] is the inconsistent syntax since we have decided data frames are collections of rows.

But removing head would mean constantly reminding people that head doesn't exist. There's not much harm in keeping it.

That's not generally an argument we have considered as legitimate on its own, as consistency with other languages would make Julia inconsistent with itself. When several options are mostly equivalent, taking the one which is consistent with other languages is a good idea, but when something already exists in Julia duplicating features just to help users of language X makes the API bigger, and what happens when somebody coming from language Y wants to add another name for that function?

@bkamins
Copy link
Member

bkamins commented Nov 25, 2018

I don't think the inconsistency with df[i] is a problem

agreed

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments that I would prefer being resolved before accepting. Also when you make a question/PR to base regarding first/last with n parameter could you please add a reference here so I can track it?

**Arguments**
Get the first row of `df` as a `DataFrameRow`.
"""
Base.first(df::AbstractDataFrame) = df[1, :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will throw an error on 0-row data frame. Maybe some better error message would be in place. Note that first(df, 1) will work on 0-row data frame. Same with last.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've followed what happens with vectors. I think that makes sense. The error message isn't too bad, it indicates you tried to access row 1 in a 0-row data frame.

**Result**
Get a data frame with the `n` first rows of `df`.
"""
Base.first(df::AbstractDataFrame, n::Integer) = df[1:min(n,nrow(df)), :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to be a bit breaking and return a SubDataFrame? This would be consistent with first(df) which also returns a view now. I do not have very strong preference here over DataFrame so @nalimilan - please decide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. first(df) returns a view only because df[1, :] does, so I'd say it's consistent to also use df[1:i, :] here, and therefore return a copy.

test/dataframe.jl Outdated Show resolved Hide resolved
@pdeffebach
Copy link
Contributor

That's not generally an argument we have considered as legitimate on its own, as consistency with other languages would make Julia inconsistent with itself.

Yeah that's a good point. It's important not to keep too many legacy APIs for sure. It's also worth nothing that i've never found head useful, and would not miss it.

@bkamins
Copy link
Member

bkamins commented Dec 1, 2018

I think we can merge this

@nalimilan nalimilan merged commit 5151255 into master Dec 1, 2018
@nalimilan nalimilan deleted the nl/firstlast branch December 1, 2018 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants