optimization: use sendfile in create and extract #33

StefanKarpinski · 2020-04-29T15:38:26Z

It would be faster to use sendfile or equivalent for the data transfer part of tarball creation and extraction instead of a user-space buffered read/write loop. Relevant code that should be optimized:

https://github.com/JuliaIO/Tar.jl/blob/b8bd833254b48428f1ce0bf4/src/create.jl#L225-L231
https://github.com/JuliaIO/Tar.jl/blob/b8bd833254b48428f1ce0bf4/src/extract.jl#L288-L293

Keno · 2020-05-01T01:15:09Z

Hah, I was complaining about this to Elliot a few days ago, since we're using Tar.jl for the rr traces and the tar'ing up step is too slow. For some numbers, on my benchmark Tar.jl uses 60% of one core in addition to 100% of gzip. Regular tar uses about 6%, which probably suggests that whatever buffer size Tar.jl currently uses is too small. As you mentioned, the correct thing to do is to splice the file directly into the output I/O stream to have Tar.jl CPU utilization be 0.

Keno · 2020-05-01T01:23:46Z

And indeed using a faster compressor, like zstd, this becomes a bottleneck with Tar.jl taking 28s vs about 2s for regular tar.

StefanKarpinski · 2020-05-01T13:19:56Z

Using a bigger buffer would be pretty easy—currently it's 512 bytes, which is very small. But it feels very unnecessary to use a buffer here at all. Do we have an API that exposes sendfile? The other issue is when Tar.jl is used with TranscodingStreams and CodecZlib, in which case the destination (for create) or source (for extract) is not a real file handle anyway and what we'd want ideally is a way to have TranscodingStreams send the data directly to the output stream.

StefanKarpinski · 2020-05-01T13:22:24Z

I would also be ok with not using TranscodingStreams in performance-sensitive situations, creating JLLs for gzip and co instead (for portability) and then using sendfile to send data to/from the external gzip process without needing to pass through Julia's user space at all.

giordano · 2020-05-01T13:25:13Z

creating JLLs for gzip

I think I have it in a branch already, I just never opened the PR

StefanKarpinski · 2020-05-01T13:40:36Z

Having those as external programs via JLL would be nice in any case because doing compression/decompression via pipe if often both efficient and convenient.

Seems to be about 6x faster than gzip, but now bottlenecked on JuliaIO/Tar.jl#33.

Makes creating a tarball and compressing it with `zstdmt` about 6x faster (30s vs 5s). Raw `tar` is still about 20% faster, but we'd probably need #33 to make up the difference.

* Increase default buffer size Makes creating a tarball and compressing it with `zstdmt` about 6x faster (30s vs 5s). Raw `tar` is still about 20% faster, but we'd probably need #33 to make up the difference. * Buffer for extract also * 1.3 compat

StefanKarpinski mentioned this issue Apr 29, 2020

use Tar.jl to create and extract tarballs JuliaPackaging/PkgServer.jl#29

Merged

Keno added a commit to JuliaLang/BugReporting.jl that referenced this issue May 1, 2020

Switch compression to zstdmt

e160dbd

Seems to be about 6x faster than gzip, but now bottlenecked on JuliaIO/Tar.jl#33.

Keno mentioned this issue May 1, 2020

Switch compression to zstdmt JuliaLang/BugReporting.jl#10

Merged

Keno added a commit that referenced this issue May 1, 2020

Increase default buffer size

aa4892f

Makes creating a tarball and compressing it with `zstdmt` about 6x faster (30s vs 5s). Raw `tar` is still about 20% faster, but we'd probably need #33 to make up the difference.

Keno mentioned this issue May 1, 2020

Increase default buffer size #34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimization: use sendfile in create and extract #33

optimization: use sendfile in create and extract #33

StefanKarpinski commented Apr 29, 2020

Keno commented May 1, 2020

Keno commented May 1, 2020

StefanKarpinski commented May 1, 2020

StefanKarpinski commented May 1, 2020

giordano commented May 1, 2020

StefanKarpinski commented May 1, 2020

optimization: use sendfile in create and extract #33

optimization: use sendfile in create and extract #33

Comments

StefanKarpinski commented Apr 29, 2020

Keno commented May 1, 2020

Keno commented May 1, 2020

StefanKarpinski commented May 1, 2020

StefanKarpinski commented May 1, 2020

giordano commented May 1, 2020

StefanKarpinski commented May 1, 2020