-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve flatten (slightly breaking) #2767
Comments
As proposed in #2766 we could:
|
I am much more concerned about supporting |
Yeah - that is why I have listed all options to consider 😄. I also think @sprmnt21 - thank you for raising this issue as it really is relevant. |
Maybe I think errors on scalars (for example symbols) are unexpected: since numbers (the most common scalars) work, I would expect that every value is treated either as scalar or as collection, and errors would only occur when there is a "real" issue like We could have either a blacklist ( |
I think we should take no methods for Do we know if this problem came about "naturally" or was it trying to work through the docs? because the post on discourse certainly seems a bit contrived. |
an example of origin, I don't know how "natural", of this situation is the dataframe originating from the following expressions
which was an attempt to resolve this problem but, I also believe, that it is quite artificial and unlikely. |
Good point about taking no I also haven't seen a real-life case... I guess flattening a heterogeneous column is uncommon. So maybe not worth adding a |
Following the hint of Base.broadcastable, which uses
is flattened to three items function flatten_depth1(df::AbstractDataFrame,
cols::Union{ColumnIndex, MultiColumnIndex})
_check_consistency(df)
idxcols = index(df)[cols]
isempty(idxcols) && return copy(df)
col1 = first(idxcols)
# lengths = length.(df[!, col1])
flen(x) = try length(axes(x)[1]) catch; 1 end
lengths = flen.(df[!, col1])
if length(idxcols) > 1
for col in idxcols[2:end]
v = df[!, col]
if any(x -> flen(x[1]) != x[2], zip(v, lengths))
r = findfirst(x -> x != 0, length.(v) .- lengths)
colnames = _names(df)
throw(ArgumentError("Lengths of iterables stored in columns :$(colnames[col1]) " *
"and :$(colnames[col]) are not the same in row $r"))
end
end
sort!(idxcols)
end
new_df = similar(df[!, Not(cols)], sum(lengths))
for name in _names(new_df)
repeat_lengths!(new_df[!, name], df[!, name], lengths)
end
# length(idxcols) > 1 && sort!(idxcols)
for col in idxcols
# col_to_flatten = df[!, col]
# flattened_col = col_to_flatten isa AbstractVector{<:AbstractVector} ?
# reduce(vcat, [i for i in j, j in df[!, col]]) :
# collect(Iterators.flatten(col_to_flatten))
# [["aa", ["bb"]], "cc", ("dd", "ee")] flattened to ["aa", ["bb"], "cc", "dd", "ee"]
flattened_col = mapreduce(vcat, df[!, col], lengths) do i,l
l > 1 ? [j for j in i] : ifelse(isa(i, AbstractVector), i, [i])
end
insertcols!(new_df, col, _names(df)[col] => flattened_col)
end
return new_df
end |
I have thought about it and would add What do you think? CC @nalimilan |
I kind of regret we don't treat strings as scalars, but now that we do I agree that allowing to pass types to treat as scalars via an argument is a reasonable solution. |
Users find it confusing that strings are iterable, see https://discourse.julialang.org/t/flatten-in-case-column-contains-string-and-array/61284/5. They are not iterable in broadcasting. On the other hand we should not follow the full semantics of
Base.broadcastable
as it will then eg. disallowDict
s (this was the original reason why I went for iterator interface).I think the way to go is to create a list of special cases. Here is a collection of standard "suspects that I think we should not flatten by default:
AbstractString
Pair
Markdown.MD
the question is if we want them to be left "as is" or throw an error. If we leave them "as is" then probably we should also consider what other types we want to leave "as is" (they currently error):
Symbol
Missing
Nothing
(I have listed only the most common cases)
The general question in the first place is: do we find it useful to complicate the API of
flatten
. My fist instinct is to throw an error on the cases we do not want to allow being flattened as this will be simpler.The text was updated successfully, but these errors were encountered: