Skip to content

Commit

Permalink
Scrub CSV.File types for Tables.Schema (#811)
Browse files Browse the repository at this point in the history
Fixes #808. The issue here is that we use an internal `PooledString`
type while parsing to signal a column is currently being pooled. Once
we're done parsing, however, there's no value in keeping `PooledString`
around, and indeed, `Tables.rowtable` even gets confused because it's
expecting a `PooledString` object but we always return `String` objects
when indexing string columns, pooled or not. The fix here is to scrub
these PooledString type columns to correct the `Tables.Schema` on
`CSV.File`. Let's see if CI points out any problems with this approach.
  • Loading branch information
quinnj authored Feb 25, 2021
1 parent 425f2de commit c386238
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 0 deletions.
2 changes: 2 additions & 0 deletions src/file.jl
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,8 @@ function File(h::Header;
if types[i] === Union{}
types[i] = Missing
columns[i] = MissingVector(finalrows)
elseif schematype(types[i]) !== types[i]
types[i] = schematype(types[i])
end
end
end
Expand Down
4 changes: 4 additions & 0 deletions src/utils.jl
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ with the output array type being a `PooledArray`.
"""
struct PooledString <: AbstractString end

schematype(::Type{T}) where {T} = T
schematype(::Type{PooledString}) = String
schematype(::Type{Union{Missing, PooledString}}) = Union{Missing, String}

# PointerString is an internal-only type for efficiently tracking string data + length
# all strings indexed from a column/row will always be a full String
# specifically, it allows avoiding materializing full Strings for pooled string columns while parsing
Expand Down

0 comments on commit c386238

Please sign in to comment.