Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodeUnits input broken with CSV.jl 0.9 #894

Closed
knuesel opened this issue Sep 10, 2021 · 4 comments · Fixed by #905
Closed

CodeUnits input broken with CSV.jl 0.9 #894

knuesel opened this issue Sep 10, 2021 · 4 comments · Fixed by #905

Comments

@knuesel
Copy link

knuesel commented Sep 10, 2021

The following used to work in CSV.jl 0.8:

julia> CSV.File(codeunits("A,B\n1,2\n"), header=["x", "y"], skipto=2)
1-element CSV.File{false}:
 CSV.Row: (x = 1, y = 2)

with 0.9.1 I get

ERROR: MethodError: no method matching CSV.File(::Base.CodeUnits{UInt8, String}; header=["x", "y"], skipto=2)
Closest candidates are:
  CSV.File(::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any) at /home/ksj1/.julia/packages/CSV/22psr/src/file.jl:105 got unsupported keyword arguments "header", "skipto"
  CSV.File(::CSV.Context) at /home/ksj1/.julia/packages/CSV/22psr/src/file.jl:222 got unsupported keyword arguments "header", "skipto"
  CSV.File(::CSV.Context, ::Bool) at /home/ksj1/.julia/packages/CSV/22psr/src/file.jl:222 got unsupported keyword arguments "header", "skipto"
  ...

(It works when replacing codeunits with IOBuffer.)

Maybe CodeUnits as input is not really supported? The documentation for input mentions support for Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}} but not AbstractVector{UInt8}. On the other hand, the documentation for CSV.Rows mentions support for

AbstractVector{UInt8} like a byte buffer or codeunits(string)

@quinnj
Copy link
Member

quinnj commented Sep 15, 2021

Yeah, sorry for the change/hassle here. For context, this happened to work previously, but wasn't officially supported, and actually what was happening, I believe, is that a copy of the input CodeUnits was being made to convert it to a Vector{UInt8}.

We can maybe still support this; one idea is to call unsafe_wrap(Array, pointer(source.s), length(source)), which would at least work as long as the string underneath the CodeUnits type supported pointer (which most do that I'm aware of). I'd only be a little worried because we're going around the GC's back there; so the user would be in charge of keeping the original string "reachable", otherwise our input buffer could get stolen from underneath us. We do already keep a reference to any temporary file that is generated when automatically decompressing gzip files, so maybe we could do something similar where we'll keep our own reference to the input CodeUnits to ensure it stays valid for the life of CSV.File at least. Hmm, but even then, if someone requested stringtype=PosLenString, then those wouldn't be guaranteed safe. We'd at least have to check if the user passed a CodeUnits as the input and stringtype=PosLenString and then throw an argument error that passing both isn't allowed.

quinnj added a commit that referenced this issue Sep 15, 2021
Fixes #894. I believe this used to work because we made a copy of the
input `CodeUnits` object, which is fine, because we got a
`Vector{UInt8}` out of it, but not ideal since we made a copy. After
some research, I found that when you call `IOBuffer(str)`, it uses this
nifty little trick of calling `unsafe_wrap(Vector{UInt8}, str)` which is
then passed to `IOBuffer` as a way to convert a string to a
`Vector{UInt8}` without copying. We can utilize the same trick to
efficiently treat a string as a `Vector{UInt8}` while the Julia
internals takes care of tracking the true owner of the data as the
original string for us (thus avoiding any GC issues if we were to
naively call `unsafe_wrap(Array, pointer(str))` ourselves).
@quinnj
Copy link
Member

quinnj commented Sep 15, 2021

Ok, I think I figured out a clean way for us to support this: #905

@knuesel
Copy link
Author

knuesel commented Sep 15, 2021

Thanks for addressing this! Note that for me it's fine to use IOBuffer instead of codeunits. I was just worried this was an unintended breaking change.

If the intention is really to work on a byte stream, maybe it's best to leave codeunits unsupported (and fix the documentation of CSV.Rows)? Because for UTF-16 for example codeunits doesn't return a list of UInt8...

quinnj added a commit that referenced this issue Sep 15, 2021
Fixes #894. I believe this used to work because we made a copy of the
input `CodeUnits` object, which is fine, because we got a
`Vector{UInt8}` out of it, but not ideal since we made a copy. After
some research, I found that when you call `IOBuffer(str)`, it uses this
nifty little trick of calling `unsafe_wrap(Vector{UInt8}, str)` which is
then passed to `IOBuffer` as a way to convert a string to a
`Vector{UInt8}` without copying. We can utilize the same trick to
efficiently treat a string as a `Vector{UInt8}` while the Julia
internals takes care of tracking the true owner of the data as the
original string for us (thus avoiding any GC issues if we were to
naively call `unsafe_wrap(Array, pointer(str))` ourselves).
@quinnj
Copy link
Member

quinnj commented Sep 16, 2021

The support that was added was specifically Base.CodeUnits{UInt8, String}, so it's not any codeunits, but specifically utf8 strings from Base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants