Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isempty(df) should return true if either dimension == 0. #1231

Merged
merged 2 commits into from
Sep 13, 2017

Conversation

rofinn
Copy link
Member

@rofinn rofinn commented Sep 12, 2017

Fixes #1230.

@coveralls
Copy link

coveralls commented Sep 12, 2017

Coverage Status

Coverage remained the same at 86.815% when pulling 50c7917 on rofinn:rf/isempty-fix into 5dce05f on JuliaData:master.

@@ -254,7 +254,7 @@ end

Base.haskey(df::AbstractDataFrame, key::Any) = haskey(index(df), key)
Base.get(df::AbstractDataFrame, key::Any, default::Any) = haskey(df, key) ? df[key] : default
Base.isempty(df::AbstractDataFrame) = ncol(df) == 0
Base.isempty(df::AbstractDataFrame) = size(df, 2) == 0 || size(df, 1) == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually need to check size(df, 2) == 0? If there are no columns, there can't be any rows, so the second condition will hold already (and I think size already has a special case for that which returns 0).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan Currently, I don't think there is a a way to construct a dataframe where there are rows without columns, but I could imagine several ways that assumption might be broken in the future. I personally like that this is very explicit and makes as few assumption about the dataframe. I'm not aware of any special cases for size, is this a special case for Dataframes specifically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, base julia does this by checking that the length (# of element) equals 0, but that isn't a reasonable option given the debate around what length(df) should return ( #1200).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the definition of nrows, which is used by size:

nrow(df::DataFrame) = ncol(df) > 0 ? length(df.columns[1])::Int : 0

So you don't need to check the number of columns again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, while I agree that the extra col check in isempty currently isn't needed, I'd like to minimize the number of assumptions being made about the structure of the dataframe moving forward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see an argument for reversing the order though if we want to short circuit on nrows? That would currently bypass the extra check while remaining explicit and less dependent on the underlying structure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the addition of the extra col check seems fairly minimal compared to checking nrows.

julia> @benchmark size($df, 2) == 0
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.684 ns (0.00% GC)
  median time:      6.690 ns (0.00% GC)
  mean time:        6.896 ns (0.00% GC)
  maximum time:     29.170 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark size($df, 1) == 0
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     39.164 ns (0.00% GC)
  median time:      39.681 ns (0.00% GC)
  mean time:        40.402 ns (0.00% GC)
  maximum time:     68.512 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     991

julia> @benchmark size($df, 1) == 0 || size($df, 2) == 0
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     44.666 ns (0.00% GC)
  median time:      47.513 ns (0.00% GC)
  mean time:        48.973 ns (0.00% GC)
  maximum time:     68.798 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are probably going to be inlined and optimized out anyway, I was just thinking in terms of consistency of the code. Honestly, I don't see how in the future a data frame could be allowed to have no columns and yet contain multiple rows, but if you're afraid of that, why not...

@coveralls
Copy link

coveralls commented Sep 13, 2017

Coverage Status

Coverage increased (+0.3%) to 87.117% when pulling 18cdfd8 on rofinn:rf/isempty-fix into 5dce05f on JuliaData:master.

@nalimilan nalimilan merged commit fb72599 into JuliaData:master Sep 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants