Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Tables.rowtable when entries contain a vector of strings #167

Closed
ericphanson opened this issue Apr 6, 2021 · 8 comments · Fixed by #182
Closed

Issue with Tables.rowtable when entries contain a vector of strings #167

ericphanson opened this issue Apr 6, 2021 · 8 comments · Fixed by #182

Comments

@ericphanson
Copy link
Member

with Arrow.jl 1.3,

julia> using Arrow, Tables

julia> table = (; col1 = [["a"], ["a"]])
(col1 = [["a"], ["a"]],)

julia> Arrow.write("test.arrow", table)
"test.arrow"

julia> table2 = Arrow.Table("test.arrow")
Arrow.Table: (col1 = [["a"], ["a"]],)

julia> Tables.rowtable(table2)
ERROR: TypeError: in new, expected Vector{String}, got a value of type SubArray{String, 1, Arrow.List{String, Int32, Vector{UInt8}}, Tuple{UnitRange{Int64}}, true}
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/Tables/UxLRG/src/namedtuples.jl:46 [inlined]
 [2] _iterate(rows::Tables.NamedTupleIterator{Tables.Schema{(:col1,), Tuple{Vector{String}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}}, st::Tuple{})
   @ Tables ~/.julia/packages/Tables/UxLRG/src/namedtuples.jl:37
 [3] iterate
   @ ~/.julia/packages/Tables/UxLRG/src/namedtuples.jl:58 [inlined]
 [4] iterate
   @ ~/.julia/packages/Tables/UxLRG/src/namedtuples.jl:57 [inlined]
 [5] copyto!(dest::Vector{NamedTuple{(:col1,), Tuple{Vector{String}}}}, src::Tables.NamedTupleIterator{Tables.Schema{(:col1,), Tuple{Vector{String}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}})
   @ Base ./abstractarray.jl:843
 [6] _collect
   @ ./array.jl:608 [inlined]
 [7] collect(itr::Tables.NamedTupleIterator{Tables.Schema{(:col1,), Tuple{Vector{String}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}})
   @ Base ./array.jl:602
 [8] rowtable(itr::Arrow.Table)
   @ Tables ~/.julia/packages/Tables/UxLRG/src/namedtuples.jl:99
 [9] top-level scope
   @ REPL[27]:1
@quinnj
Copy link
Member

quinnj commented Apr 7, 2021

Ah yes, I thought this would probably end up biting us at some point. We're kind of lying with our schema for the table vs. what we actually return. To avoid allocating a full array on each getindex, it currently just constructs a view into the underlying arrow data. We'll have to think of what the best approach is going forward.

@ericphanson
Copy link
Member Author

If I'm understanding your comment right, that's not something new, right? But my example (and the non-minimal one too) works fine on Arrow 1.2, so it seems like a new issue here.

@quinnj
Copy link
Member

quinnj commented Apr 14, 2021

Yeah, I think we maybe used to allocate the full array, and it was changed to return a view; the code is here if you want to take a look/play around. I would approach this by first checking if we should just change the Tables.schema that is returned for Arrow.Table; if the schema matched what getindex produced, Tables.rowtable wouldn't get angry at us. Secondly/alternatively, I'd look at changing getindex: either just allocating the full Vector when indexing, or maybe trying to do something more efficient like returning an Arrow.Primitive array for bitstypes, and Arrow.List for strings.

@ericphanson
Copy link
Member Author

Ok, thanks for the pointer!

Just as a note to myself / other Beacon users, this also hits Onda.Samples (ref beacon-biosignals/Onda.jl#68):

julia> segments_table = DataFrame(Arrow.Table("blah.arrow"))
ERROR: MethodError: Cannot `convert` an object of type 
  Onda.Samples{Base.ReshapedArray{Float64, 2, SubArray{Float64, 1, Arrow.Primitive{Float64, Vector{Float64}}, Tuple{UnitRange{Int64}}, true}, Tuple{}},Onda.SamplesInfo{String, Vector{String}, String, Float64, Float64, Int16, Float64}} to an object of type 
  Onda.Samples{Matrix{Float64},Onda.SamplesInfo{String, Vector{String}, String, Float64, Float64, var"#s157", Float64} where var"#s157"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}}
Closest candidates are:
  convert(::Type{T}, ::T) where T at essentials.jl:205

@quinnj
Copy link
Member

quinnj commented Apr 22, 2021

So the more I've thought about it, the more I think we should change the getindex method on Arrow.List to just return a full Vector instead of a view. The user originally saved a vector and they probably want a full vector out, so we should respect that. We can potentially keep the view there and just make sure that it gets converted to a full Vector on the StructTypes.construct call (I'm pretty sure we have a method right now that passes a view through as a full vector). I can get to this pretty soon, unless someone else wants to take a stab at it.

@quinnj
Copy link
Member

quinnj commented Apr 22, 2021

Ok, haven't added tests yet, but I think this is all that's needed. @ericphanson, would you mind checking your use-case? If you're so inclined, provide a test I can add to my PR?

#182

quinnj added a commit that referenced this issue Apr 23, 2021
quinnj added a commit that referenced this issue Apr 23, 2021
* Ensure requested List type is requested on List getindex

Fixes #167. Not tested yet.

* add test
@ericphanson
Copy link
Member Author

Thanks! I can confirm with Arrow#main my original use-case works as does the Onda.Samples one. I think the MWE from my original post could serve as a test-case too, although it looks like you added one to the PR already. Let me know if you'd like any others and I can try to make one.

@ericphanson
Copy link
Member Author

I was trying to come up with a MWE for #199 and came across

julia> col = [(; b=1), missing]
2-element Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}:
 (b = 1,)
 missing

julia> table = [(;a = col)]
1-element Vector{NamedTuple{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}}:
 (a = [(b = 1,), missing],)

julia> Tables.rowtable(Arrow.Table(Arrow.tobuffer(table)))
ERROR: TypeError: in new, expected Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}, got a value of type SubArray{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}, 1, Arrow.Struct{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}, Tuple{Arrow.Primitive{Int64, Vector{Int64}}}}, Tuple{UnitRange{Int64}}, true}
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:46 [inlined]
 [2] _iterate(rows::Tables.NamedTupleIterator{Tables.Schema{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}}, st::Tuple{})
   @ Tables ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:37
 [3] iterate
   @ ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:58 [inlined]
 [4] iterate
   @ ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:57 [inlined]
 [5] copyto!(dest::Vector{NamedTuple{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}}, src::Tables.NamedTupleIterator{Tables.Schema{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}})
   @ Base ./abstractarray.jl:843
 [6] _collect
   @ ./array.jl:608 [inlined]
 [7] collect(itr::Tables.NamedTupleIterator{Tables.Schema{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}})
   @ Base ./array.jl:602
 [8] rowtable(itr::Arrow.Table)
   @ Tables ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:99
 [9] top-level scope
   @ REPL[41]:1

Here, I didn't mean to do table = [(; a= col)] and had meant to do just table = (; a = col) (which does work). However, I think the version that errors here should still work?

I thought it kind of looked like this issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants