Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimization: use sendfile in create and extract #33

Open
StefanKarpinski opened this issue Apr 29, 2020 · 6 comments
Open

optimization: use sendfile in create and extract #33

StefanKarpinski opened this issue Apr 29, 2020 · 6 comments

Comments

@StefanKarpinski
Copy link
Member

It would be faster to use sendfile or equivalent for the data transfer part of tarball creation and extraction instead of a user-space buffered read/write loop. Relevant code that should be optimized:

https://github.com/JuliaIO/Tar.jl/blob/b8bd833254b48428f1ce0bf4/src/create.jl#L225-L231
https://github.com/JuliaIO/Tar.jl/blob/b8bd833254b48428f1ce0bf4/src/extract.jl#L288-L293

@Keno
Copy link
Member

Keno commented May 1, 2020

Hah, I was complaining about this to Elliot a few days ago, since we're using Tar.jl for the rr traces and the tar'ing up step is too slow. For some numbers, on my benchmark Tar.jl uses 60% of one core in addition to 100% of gzip. Regular tar uses about 6%, which probably suggests that whatever buffer size Tar.jl currently uses is too small. As you mentioned, the correct thing to do is to splice the file directly into the output I/O stream to have Tar.jl CPU utilization be 0.

@Keno
Copy link
Member

Keno commented May 1, 2020

And indeed using a faster compressor, like zstd, this becomes a bottleneck with Tar.jl taking 28s vs about 2s for regular tar.

@StefanKarpinski
Copy link
Member Author

Using a bigger buffer would be pretty easy—currently it's 512 bytes, which is very small. But it feels very unnecessary to use a buffer here at all. Do we have an API that exposes sendfile? The other issue is when Tar.jl is used with TranscodingStreams and CodecZlib, in which case the destination (for create) or source (for extract) is not a real file handle anyway and what we'd want ideally is a way to have TranscodingStreams send the data directly to the output stream.

@StefanKarpinski
Copy link
Member Author

I would also be ok with not using TranscodingStreams in performance-sensitive situations, creating JLLs for gzip and co instead (for portability) and then using sendfile to send data to/from the external gzip process without needing to pass through Julia's user space at all.

@giordano
Copy link
Member

giordano commented May 1, 2020

creating JLLs for gzip

I think I have it in a branch already, I just never opened the PR

@StefanKarpinski
Copy link
Member Author

Having those as external programs via JLL would be nice in any case because doing compression/decompression via pipe if often both efficient and convenient.

Keno added a commit to JuliaLang/BugReporting.jl that referenced this issue May 1, 2020
Seems to be about 6x faster than gzip, but now bottlenecked on
JuliaIO/Tar.jl#33.
Keno added a commit that referenced this issue May 1, 2020
Makes creating a tarball and compressing it with `zstdmt` about 6x
faster (30s vs 5s). Raw `tar` is still about 20% faster, but we'd
probably need #33 to make
up the difference.
Keno added a commit that referenced this issue May 3, 2020
* Increase default buffer size

Makes creating a tarball and compressing it with `zstdmt` about 6x
faster (30s vs 5s). Raw `tar` is still about 20% faster, but we'd
probably need #33 to make
up the difference.

* Buffer for extract also

* 1.3 compat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants