Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supertype of AbstractDataFrame and DataFrameRow #1337

Closed
bkamins opened this issue Dec 31, 2017 · 14 comments
Closed

Supertype of AbstractDataFrame and DataFrameRow #1337

bkamins opened this issue Dec 31, 2017 · 14 comments

Comments

@bkamins
Copy link
Member

bkamins commented Dec 31, 2017

I have also even a question (maybe it was discussed somewhere, but I think the decision could influence the design we are discussing in #1335 so I raise it now):

  • why AbstractDataFrame is not a subtype of AbstractMatrix{Any}?
  • why DataFrameRow is not a subtype of AbstractDataFrame?

In this way many functions that work on AbstractMatrix would work on DataFrames for free.
Additionally if some broadcast related features were implemented like in NamedArrays the result of such operations could remain a DataFrame if that were sensible.
I do not see any negative side effects (but I might be missing something).

The benefit is that now when the user knows that some columns in DataFrame are homogeneous then Array(df[rows, cols]) conversion has to be run to be able to perform the desired operations.
Of course in performance sensitive code it will still be required to do so, because conversion will infer the type of the [rows, cols] section of the DataFrame, but in many cases user wants a simple transformation and performance is not an issue (see eg. https://stackoverflow.com/questions/48037732/how-to-save-julia-for-loop-returns-in-an-array-or-dataframe).

@ararslan
Copy link
Member

I don't know the original motivations, but I can think of a few reasons why AbstractDataFrame shouldn't necessarily subtype AbstractMatrix:

  • We may not want to support all operations defined for matrices, in particular linear algebraic operations. For example, df1'df2 doesn't really make sense.
  • Depending on your point of view, a DataFrame isn't necessarily a 2-dimensional object; one could argue that each column represents another axis, which would make it an N-dimensional array rather than a matrix. (Indeed, the original name for JuliaDB was NDSparse.)

As for why DataFrameRow isn't an AbstractDataFrame. However, should #1335 come to fruition, I think we should be able to do away with DataFrameRow entirely and represent rows as NamedTuples.

@bkamins
Copy link
Member Author

bkamins commented Dec 31, 2017

As for AbstractDataFrame <: AbstractMatrix{Any} I see those points, but e.g. transposition of data frame is a supported operation in other frameworks. Also I do not see a strong reason not to allow multiplication (R defines broadcasted addition and multiplication on data frames). If the operation does make sense it will fail anyway.

I am not very strongly for this change, but I want to put it under the discussion as this is something that e.g. R users would find convenient (of course it is not a definitive reason to allow it - we do not have to replicate R in DataFrames).

In general it boils down to what you have pointed out - do we treat DataFrame as a heterogeneous matrix or sparse N-dimensional array.

As for DataFrameRow - such a NamedTuple would have to consist of 1-element views into columns of an original DataFrame (this is needed as contract of DataFrameRow is that it is a view into a row of the original DataFrame allowing to mutate its contents) and thus would be expensive to create I guess.
And if I understand the original intention behind DataFrameRow is that it were to be maximally lightweight data structure (containing only a reference to original DataFrame and row number).

@nalimilan
Copy link
Member

I think this has been discussed a lot before and the decision to make Julia data frames completely distinct from matrices has been a very conscious one (you should be able to find old issues about it). It's indeed surprising for R users, but in general R data frames are quite awkward to work with, and redesigns like dplyr adopt a completely different approach closer to databases. Anyway it's not a good idea to provide people with an apparently convenient syntax which would be very slow due to type instabilities: that would just be a trap.

Regarding DataFrameRow, I also think that it should be replaced with NamedTuple. Since NamedTuple is immutable, it should be completely free to create (no actual allocation), so that in practice creating a NamedTuple, modifying it and replacing the contents of a row with it should be equivalent to doing the same operation with a DataFrameRow (at least with a type-stable data frame).

@bkamins
Copy link
Member Author

bkamins commented Jan 1, 2018

Thanks for the explanation. Regarding DataFrameRow - do I understand you correctly that such code would become invalid after the change:

x = DataFrame(a=1:10, b=1:10)

for row in eachrow(x)
    row[:a] += row[:b]
end

and now x[:a] is doubled?

@ararslan
Copy link
Member

ararslan commented Jan 1, 2018

That's a very good point. You're correct, if eachrow iterated NamedTuples, then row would be immutable and row[:a] += row[:b] would be an error. That's rather unfortunate...

@nalimilan
Copy link
Member

Ah, right. Maybe we need eachrow to keep returning a DataFrameRow then. It's an interesting design case, and there doesn't seem to be many situations in which one may need to mutate the elements obtained by iterating over a collection. That's the case for matrices, where it makes sense to return a row or column SubArray rather than a copy (see proposals at JuliaLang/julia#14491), but that's also for performance reasons to avoid allocations (which doesn't apply to tuples).

JuliaLang/julia#21912 is related, since it would provide a convenient syntax to create a new NamedTuple as if you had mutated the original one. Then of course the tuple would have to be assigned back to the DataFrame, which is annoying. We would have to provide another syntax for that, such as:

byrow!(df) do row
    row@a += row[:b]
    row
end

But clearly that's not ideal. We could also add this byrow! function and make it use DataFrameRow objects, but change eachrow to use NamedTuple. I guess that depends on how strongly we want to push the view of data frames as collections of NamedTuple. There's also the solution of providing multiple iterator functions, e.g. eachrow and rows with different behaviors.

@bkamins
Copy link
Member Author

bkamins commented Jan 2, 2018

I am OK to change DataFrameRow to NamedTuple - just want to make sure that we understand the consequences.

The reasons I am mostly indifferent is that in general I do not think that eachrow is very useful - as of today it minimally saves typing and you have to remember that you work on a different type than AbstractDataFrame.

The difference is between:

for row in eachrow(df)
    do_something_with(row[:column_name])
end

and

for row in 1:nrow(df)
    do_something_with(df[row, :column_name])
end

Additionally working on DataFrameRow is slow, but it will be equally slow with NamedTuple implementation in most cases (in order to get significant performance benefits this NamedTuple would have to be passed to some internal barrier function as eltype(eachrow(::DataFrame)) will not be inferred by the compiler anyway).

@nalimilan
Copy link
Member

Yes, it looks like we should provide functions taking an anonymous function which would be applied to each row, as that's the only way to get type stability. That's why I recently added filter (#1330). Maybe we should implement map and map!, operating row by row.

DataFrameRow could be useful if we provide a type like TypedDataFrame and advise using it for functions using explicit loops, but it's not clear yet that we want to do that, and anyway for many situations people won't write functions, and will run code in the global scope.

@bkamins
Copy link
Member Author

bkamins commented Jan 2, 2018

I understand that TypeDataFrame would render CompositeDataFrame from DataFramesMeta obsolete - yes?

@nalimilan
Copy link
Member

I suspect it would, but I haven't looked at it in detail yet. It would be interesting to see how it's implemented.

@davidanthoff
Copy link
Contributor

One could also think about a type

struct RowView{T} where {T<:TypedDataFrame}
  df::T
  row::Int
end

that overrides getproperty and setproperty. That should enable things like

for r in rowviews(df)
  println(r.colB)
  r.colA = 34
end

@nalimilan
Copy link
Member

Yes, that's essentially DataFrameRow but for a TypedDataFrame.

@bkamins
Copy link
Member Author

bkamins commented Jan 2, 2018

Yes, the only issue is that with current proposal in #1335 df would be normally a DataFrame not a TypedDataFrame, which means that rowviews(df) would not have type stable return value when df::DataFrame.

@bkamins
Copy link
Member Author

bkamins commented Dec 14, 2018

I am closing it as now it is more or less settled that we keep DataFrameRow and it is distinct than DataFrame.

@bkamins bkamins closed this as completed Dec 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants