Supertype of AbstractDataFrame and DataFrameRow #1337

bkamins · 2017-12-31T21:30:07Z

I have also even a question (maybe it was discussed somewhere, but I think the decision could influence the design we are discussing in #1335 so I raise it now):

why AbstractDataFrame is not a subtype of AbstractMatrix{Any}?
why DataFrameRow is not a subtype of AbstractDataFrame?

In this way many functions that work on AbstractMatrix would work on DataFrames for free.
Additionally if some broadcast related features were implemented like in NamedArrays the result of such operations could remain a DataFrame if that were sensible.
I do not see any negative side effects (but I might be missing something).

The benefit is that now when the user knows that some columns in DataFrame are homogeneous then Array(df[rows, cols]) conversion has to be run to be able to perform the desired operations.
Of course in performance sensitive code it will still be required to do so, because conversion will infer the type of the [rows, cols] section of the DataFrame, but in many cases user wants a simple transformation and performance is not an issue (see eg. https://stackoverflow.com/questions/48037732/how-to-save-julia-for-loop-returns-in-an-array-or-dataframe).

The text was updated successfully, but these errors were encountered:

ararslan · 2017-12-31T22:02:43Z

I don't know the original motivations, but I can think of a few reasons why AbstractDataFrame shouldn't necessarily subtype AbstractMatrix:

We may not want to support all operations defined for matrices, in particular linear algebraic operations. For example, df1'df2 doesn't really make sense.
Depending on your point of view, a DataFrame isn't necessarily a 2-dimensional object; one could argue that each column represents another axis, which would make it an N-dimensional array rather than a matrix. (Indeed, the original name for JuliaDB was NDSparse.)

As for why DataFrameRow isn't an AbstractDataFrame. However, should #1335 come to fruition, I think we should be able to do away with DataFrameRow entirely and represent rows as NamedTuples.

bkamins · 2017-12-31T22:31:26Z

As for AbstractDataFrame <: AbstractMatrix{Any} I see those points, but e.g. transposition of data frame is a supported operation in other frameworks. Also I do not see a strong reason not to allow multiplication (R defines broadcasted addition and multiplication on data frames). If the operation does make sense it will fail anyway.

I am not very strongly for this change, but I want to put it under the discussion as this is something that e.g. R users would find convenient (of course it is not a definitive reason to allow it - we do not have to replicate R in DataFrames).

In general it boils down to what you have pointed out - do we treat DataFrame as a heterogeneous matrix or sparse N-dimensional array.

As for DataFrameRow - such a NamedTuple would have to consist of 1-element views into columns of an original DataFrame (this is needed as contract of DataFrameRow is that it is a view into a row of the original DataFrame allowing to mutate its contents) and thus would be expensive to create I guess.
And if I understand the original intention behind DataFrameRow is that it were to be maximally lightweight data structure (containing only a reference to original DataFrame and row number).

nalimilan · 2018-01-01T21:56:57Z

I think this has been discussed a lot before and the decision to make Julia data frames completely distinct from matrices has been a very conscious one (you should be able to find old issues about it). It's indeed surprising for R users, but in general R data frames are quite awkward to work with, and redesigns like dplyr adopt a completely different approach closer to databases. Anyway it's not a good idea to provide people with an apparently convenient syntax which would be very slow due to type instabilities: that would just be a trap.

Regarding DataFrameRow, I also think that it should be replaced with NamedTuple. Since NamedTuple is immutable, it should be completely free to create (no actual allocation), so that in practice creating a NamedTuple, modifying it and replacing the contents of a row with it should be equivalent to doing the same operation with a DataFrameRow (at least with a type-stable data frame).

bkamins · 2018-01-01T23:46:01Z

Thanks for the explanation. Regarding DataFrameRow - do I understand you correctly that such code would become invalid after the change:

x = DataFrame(a=1:10, b=1:10)

for row in eachrow(x)
    row[:a] += row[:b]
end

and now x[:a] is doubled?

ararslan · 2018-01-01T23:56:08Z

That's a very good point. You're correct, if eachrow iterated NamedTuples, then row would be immutable and row[:a] += row[:b] would be an error. That's rather unfortunate...

nalimilan · 2018-01-02T11:15:19Z

Ah, right. Maybe we need eachrow to keep returning a DataFrameRow then. It's an interesting design case, and there doesn't seem to be many situations in which one may need to mutate the elements obtained by iterating over a collection. That's the case for matrices, where it makes sense to return a row or column SubArray rather than a copy (see proposals at JuliaLang/julia#14491), but that's also for performance reasons to avoid allocations (which doesn't apply to tuples).

JuliaLang/julia#21912 is related, since it would provide a convenient syntax to create a new NamedTuple as if you had mutated the original one. Then of course the tuple would have to be assigned back to the DataFrame, which is annoying. We would have to provide another syntax for that, such as:

byrow!(df) do row
    row@a += row[:b]
    row
end

But clearly that's not ideal. We could also add this byrow! function and make it use DataFrameRow objects, but change eachrow to use NamedTuple. I guess that depends on how strongly we want to push the view of data frames as collections of NamedTuple. There's also the solution of providing multiple iterator functions, e.g. eachrow and rows with different behaviors.

bkamins · 2018-01-02T16:30:56Z

I am OK to change DataFrameRow to NamedTuple - just want to make sure that we understand the consequences.

The reasons I am mostly indifferent is that in general I do not think that eachrow is very useful - as of today it minimally saves typing and you have to remember that you work on a different type than AbstractDataFrame.

The difference is between:

for row in eachrow(df)
    do_something_with(row[:column_name])
end

and

for row in 1:nrow(df)
    do_something_with(df[row, :column_name])
end

Additionally working on DataFrameRow is slow, but it will be equally slow with NamedTuple implementation in most cases (in order to get significant performance benefits this NamedTuple would have to be passed to some internal barrier function as eltype(eachrow(::DataFrame)) will not be inferred by the compiler anyway).

nalimilan · 2018-01-02T16:36:05Z

Yes, it looks like we should provide functions taking an anonymous function which would be applied to each row, as that's the only way to get type stability. That's why I recently added filter (#1330). Maybe we should implement map and map!, operating row by row.

DataFrameRow could be useful if we provide a type like TypedDataFrame and advise using it for functions using explicit loops, but it's not clear yet that we want to do that, and anyway for many situations people won't write functions, and will run code in the global scope.

bkamins · 2018-01-02T16:41:24Z

I understand that TypeDataFrame would render CompositeDataFrame from DataFramesMeta obsolete - yes?

nalimilan · 2018-01-02T16:48:42Z

I suspect it would, but I haven't looked at it in detail yet. It would be interesting to see how it's implemented.

davidanthoff · 2018-01-02T17:27:59Z

One could also think about a type

struct RowView{T} where {T<:TypedDataFrame}
  df::T
  row::Int
end

that overrides getproperty and setproperty. That should enable things like

for r in rowviews(df)
  println(r.colB)
  r.colA = 34
end

nalimilan · 2018-01-02T17:32:59Z

Yes, that's essentially DataFrameRow but for a TypedDataFrame.

bkamins · 2018-01-02T20:48:55Z

Yes, the only issue is that with current proposal in #1335 df would be normally a DataFrame not a TypedDataFrame, which means that rowviews(df) would not have type stable return value when df::DataFrame.

bkamins · 2018-12-14T09:52:13Z

I am closing it as now it is more or less settled that we keep DataFrameRow and it is distinct than DataFrame.

nalimilan mentioned this issue Jan 2, 2018

Base DataFrame on NamedTuple #1335

Closed

nalimilan mentioned this issue Apr 19, 2018

Scalar indexing by row should return a DataFrameRow #1400

Closed

bkamins closed this as completed Dec 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supertype of AbstractDataFrame and DataFrameRow #1337

Supertype of AbstractDataFrame and DataFrameRow #1337

bkamins commented Dec 31, 2017

ararslan commented Dec 31, 2017

bkamins commented Dec 31, 2017

nalimilan commented Jan 1, 2018

bkamins commented Jan 1, 2018

ararslan commented Jan 1, 2018

nalimilan commented Jan 2, 2018

bkamins commented Jan 2, 2018

nalimilan commented Jan 2, 2018

bkamins commented Jan 2, 2018

nalimilan commented Jan 2, 2018

davidanthoff commented Jan 2, 2018

nalimilan commented Jan 2, 2018

bkamins commented Jan 2, 2018

bkamins commented Dec 14, 2018

Supertype of AbstractDataFrame and DataFrameRow #1337

Supertype of AbstractDataFrame and DataFrameRow #1337

Comments

bkamins commented Dec 31, 2017

ararslan commented Dec 31, 2017

bkamins commented Dec 31, 2017

nalimilan commented Jan 1, 2018

bkamins commented Jan 1, 2018

ararslan commented Jan 1, 2018

nalimilan commented Jan 2, 2018

bkamins commented Jan 2, 2018

nalimilan commented Jan 2, 2018

bkamins commented Jan 2, 2018

nalimilan commented Jan 2, 2018

davidanthoff commented Jan 2, 2018

nalimilan commented Jan 2, 2018

bkamins commented Jan 2, 2018

bkamins commented Dec 14, 2018