Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression Fairing in contrib crate #550

Closed
wants to merge 41 commits into from
Closed

Conversation

Martin1887
Copy link
Contributor

Hello.

This pull request provides compression support to all responses using the optional feature 'Compression', as requested in #528.

This feature provides brotli and gzip compression with the following requirements:

  • Gzip and Brotli are used only when the request has a 'Accept-Encoding' header with gzip or brotli respectively.
  • When Gzip and Brotli are supported, Brotli is used.
  • If the response has a 'Content-Encoding' header gzip, then gzip is used instead brotli.
  • If the response has no 'Content-Encoding' headers or it has br Content-Encoding, brotli is used.
  • Responses with 'Content-Type' 'image/*' are never compressed.
  • Responses with 'Content-Type' 'font/*' and brotli support are compressed in font brotli mode.
  • Responses with 'Content-Type' 'text/*' and brotli support are compressed in text brotli mode.
  • For the rest of content types, generic brotli mode is used.
  • Brotli quality is set to 2 for fast compression with compression ratio similar to gzip.
  • Tests have been created to test the funcionalities and all of them pass.

Regards.

for enc in headers.into_iter()
.filter(|x| x.name() == "Content-Encoding" && x.value() != to_delete) {
let header = enc.clone();
println!("{:?}", header);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this debugging print?

@Martin1887
Copy link
Contributor Author

Martin1887 commented Jan 28, 2018

Sorry for the println!, I have removed it.

Thanks for the finding @messense!

Regards.

@Martin1887
Copy link
Contributor Author

The error in the build seems to be caused by an error compiling TestOpts for Travis, it is not related to the last commit or the pull request:

error[E0560]: struct `test::TestOpts` has no field named `quiet`
  --> /home/travis/.cargo/registry/src/github.aaakk.us.kg-1ecc6299db9ec823/compiletest_rs-0.3.5/src/lib.rs:97:9
   |
97 |         quiet: config.quiet,
   |         ^^^^^^ `test::TestOpts` does not have this field
   |
   = note: available fields are: `list`, `filter`, `filter_exact`, `run_ignored`, `run_tests` ... and 8 others
error: aborting due to previous error

Regards.

@Martin1887
Copy link
Contributor Author

The build has been fixed.

Regards.

@jebrosen
Copy link
Collaborator

First, I love the idea of a fairing for compression support and would probably use one if it was merged.

I haven't looked at the whole PR comprehensively, but there is at least one very serious bug: the fairing uses a single AtomicBool each to track whether or not brotli and/or gzip are supported. If two requests are sent (e.g. one accepting gzip and one accepting only identity), the behavior will depend on the timing of the requests and responses -- a response might be sent with gzip encoding when only identity was supported by that client.

Reading the Accept-Encoding header in on_response and eliminating on_request entirely should be both simpler and more correct.

On a separate note, some projects might want a choice of whether or not to support brotli, gzip, or other compression methods, either at compile time with feature selection or at runtime with different configs.

@Martin1887
Copy link
Contributor Author

Hello @jebrosen and thanks for your comment.

Firstly, I don't understand why I didn't view that the request is available on the on_response function, my first idea was using only on_response but then I found that the Accept-Encoding is only in the request and then I added the on_request function without noticing that the request is available on the on_response function 😖😖😖😖... my fault. I expect fixing it today.

I also found useful adding features to choose the compression method, but then I found that brotli is not supported in all browsers and that it is only supported in encrypted (https) connections, so gzip must be always supported. Then, currently the unique useful option would be ignoring brotli compression method because you want to use always gzip. As the compression method chosen is in cascade (if brotli is supported use brotli, else if gzip is supported use gzip, etc.) this adds too much complexity to the fairing, so I decided it does not worth. As a workaround, you can force gzip in your responses adding the Content-Encoding header, but then you are downloading and including in your executable brotli fairing without using it. However, I will study the features option while fixing the on_response function.

Thanks and regards.

@Martin1887
Copy link
Contributor Author

Martin1887 commented Feb 25, 2018

Hello.

The accept headers are check in the on_response function now.

I have found a easy manner of separated brotli_compression and gzip_compression features and they have been implemented too. compression feature remains available to use all available compression methods.

Tests pass and I have checked it in an own project with compression, gzip_compression and brotli_compression features.

Thanks and regards.

Copy link
Member

@SergioBenitez SergioBenitez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An early, incomplete review.

In general, the code is far too complicated for what it's doing, and it won't be accepted unless it's significantly simplified. There are also significantly more allocations than necessary. I don't think we need any non-compression-related allocation here. I started to indicate ideas for simplifying the code, but unfortunately I've run out of time. Still, I hope it's helpful.

Finally, the code appears to be checking that the response includes a Content-Encoding header to actually perform the compression. This seems pretty unintuitive. The user already attached the fairing, so clearly they want compression. The fairing should just perform the compression whenever the client allows it via the Accept-Encoding header. If necessary, you can add configuration to the fairing so that the user can select which additional content-types they don't want the fairing to perform compression on.

}

fn on_response(&self, request: &Request, response: &mut Response) {
let accept_headers: Vec<&str> = request.headers().get("Accept-Encoding").collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line and the series of booleans below are unnecessary.

The first thing this should do is check whether this is an image. The ContentType header is cached in the request, so you can do this really easy and correctly with:

// We don't compress images, so bail early.
if request.content_type().top() == "image" {
    return;
}

Then we simply want to check whether br or gzip are accepted content-encoding and perform GZIP or BR compression if so, preferring to perform brotli compression if it's accepted. This is can be done simpler as well:

let accepted_encodings = request.headers()
    .get("Accept-Encoding")
    .flat_map(|e| e.split(","));

#[cfg(feature = "brotli_compression")]
if accepted_encodings.any(|e| e == "br") {
    // Do brotli compression and return.
}

#[cfg(feature = "brotli_compression")]
if accepted_encodings.any(|e| e == "gzip") {
    // Do GZIP compression and return.
}

let mut compressed = Cursor::new(Vec::<u8>::new());
let mut params = brotli::enc::BrotliEncoderInitParams();
params.quality = 2;
let content_type = headers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this trying to do? You can get the raw content type string with request.headers().get_one("Content-Type"). But I don't think you need it. Just use request.content_type().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I collect and join in case the request had several Content-Type headers and some of them were text/*. Once more, this corner case maybe could be omited.

.get("Content-Type")
.collect::<Vec<&str>>()
.join(", ");
if content_type.contains("text/") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use top().

@Martin1887
Copy link
Contributor Author

Hello @SergioBenitez.

My idea was that the user could disable compression in specific requests (due to performance usually) forcing a Content-Encoding, but it is true that this is a weird case that could not worth it and even produce misunderstandings. Therefore, I will remove it.

@Martin1887
Copy link
Contributor Author

Hello.

The code has been simplified and the compression algorithm choose via Content-Encoding header removed.

Nevertheless, some of your suggestions have not be able to be implemented as proposed because the content_type() method returns an Option (and if let or matchis needed to check the Some value) and cfg attributes are not allowed on if expressions (if cfg!plus attr code block is needed).

Regards.

@Martin1887
Copy link
Contributor Author

I have not used pull requests changes requested before, but I think that @SergioBenitez should approve the changes before merging the pull request when he has reviewed them. Or should I do anything more?

I think that all requested changes have been implemented and the pull request is ready to be merged.

Regards.

Copy link
Member

@SergioBenitez SergioBenitez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At present, the entire response body is read into a vector and then compressed. This is unacceptable as response bodies are streams; there's no way to prevent a self-DOS here. That must be changed for this to be merged.

.flat_map(|e| e.trim().split(","))
.any(|e| e.trim() == "br" || e.trim() == "brotli")
{
#[cfg(feature = "brotli_compression")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary. The entire branch will be false if the feature isn't enabled, so this will be compiled-out anyway.

&& request
.headers()
.get("Accept-Encoding")
.flat_map(|e| e.trim().split(","))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need all of these trims?

Copy link
Contributor Author

@Martin1887 Martin1887 Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use trim in case any browser set the header with spaces (for instance: "Accept-Encoding: deflate, gzip, br"). However, since the elements are trimmed after split, the trim of the flat_mapcan be securely removed, I will do it.

In Firefox and Chrome this does not happen, but I think that this makes the fairing more robust.

// the corresponding feature is enabled
if cfg!(feature = "brotli_compression")
&& request
.headers()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should be factored out into a function or method. Then, this whole thing becomes:

if cfg!(feature = "brotli_compression") && has_encoding(&request, &["br", "brotli"]) {
    ....
}

This applies to the else if below as well.

{
#[cfg(feature = "brotli_compression")]
{
let body = response.body_bytes();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're reading the entire body here. This is really, really bad. What if it's a 1GiB file, or an endless stream? The compression has to operate on a stream.

{
#[cfg(feature = "gzip_compression")]
{
let body = response.body_bytes();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I are doing changes here, but I don't sure if they are fine.

I'm trying to use response.body() function to get streamed body, but the only manner of get the Read inside is using body.into_bytes() and then passing it as parameter to the corresponded compression function.

Is this ok?

Doing it has also forced to me to use an auxiliar boolean in order to use response mut reference only once in the same block:

        let mut success = false;
        let mut compressed = Cursor::new(Vec::<u8>::new());
        {
            let plain = response.body();
            if let Some(plain) = plain {
                let body_bytes = &plain.into_bytes();
                if let Some(body_bytes) = body_bytes {
                    success = GzEncoder::new(&mut compressed, flate2::Compression::default())
                        .write_all(&body_bytes)
                        .is_ok();
                }
            }
        }
        if success {
            Compression::set_body_and_header(response, compressed, "gzip");
        }

Is this acceptable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the streaming interface works.

If I set a sized body to the response, then everything works:

fn set_body_and_header<'h, B>(response: &mut Response<'h>, body: B, header: &'h str)
where
    B: Read + Seek + 'h,
{
    response.remove_header("Content-Encoding");
    response.adjoin_header(Header::new("Content-Encoding", header));
    response.set_sized_body(body);
}

However, replacing in this function the set_sized_bodycall with set_streaming_body makes the body empty into the response, in spite of passing the same bytes as bodyto the function.

Is there a special function to call when the streaming has finished?

Has the chunk_size parameter of the set_chunked_body special requirements?

What am I doing wrong?

Sorry if my question is too much stupid.

Copy link
Member

@SergioBenitez SergioBenitez Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your question isn't stupid at all! I really appreciate all your work here. :)

You can get the actual body with response.body().into_inner() or response.take_body().into_inner(). The returned value from these types implement Read. The compression libraries should have some kind of wrapper over any Read type that returns another Read type that, when read, compresses on the fly. This is what we need. Once you have that type, use response.set_streamed_body(compression_wrapper) to set the new body on the response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was doing that: in my Fairing the compression libraries read the body as a Cursor<Vec<u8>> and write the compressed bytes into another Cursor<Vec<u8>>.

I have done some tests and I have found that response.set_streamed_body works fine using a repeat().take() as in the documentation example. Nevertheless, when a Cursor<Vec<u8>> is passed to this function, the response body is empty.

This is the code and the results that I have used to test it:

Set a Cursorfails:

    response.remove_header("Content-Encoding");
    response.adjoin_header(Header::new("Content-Encoding", header));
    response.set_streamed_body(body);
    println!("{:?}", repeat(87).take(5));
    // println!("{:?}", body);
    println!("{:?}", response.body().unwrap().into_bytes());

Output:

Take { inner: Repeat { .. }, limit: 5 }
Some([])

Set a repeat().take()works, the Cursor has 16 bytes.

    response.remove_header("Content-Encoding");
    response.adjoin_header(Header::new("Content-Encoding", header));
    response.set_streamed_body(repeat(87).take(5));
    println!("{:?}", repeat(87).take(5));
    println!("{:?}", body);
    println!("{:?}", response.body().unwrap().into_bytes());

Output:

   Take { inner: Repeat { .. }, limit: 5 }
   Cursor { inner: [139, 5, 128, 72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33, 3], pos: 16 }
   Some([87, 87, 87, 87, 87])

As you can see, the response body has bytes when a repeat().take() is passed to set_streamed_body but when a Cursor<Vec<u8>> is passed then the body is empty. Using set_sized_body when the same Cursor<Vec<u8>> works fine.

I don't know if you can figure out where is the problem.

Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found the error: my misunderstanding about the Cursor struct. Seeking body to the start before calling set_streamed_body everything works.

All tests pass and I have tested it in a own project with JS and CSS files getting response times between 0 and 2 ms 😊.

I think that now the pull request is ready to be merged, I wait your comments.

@Martin1887
Copy link
Contributor Author

@SergioBenitez, have you had any chance to review the last changes?

params.mode = BrotliEncoderMode::BROTLI_MODE_FONT;
}
}
if plain.into_inner().read_to_end(&mut plain_buffer).is_ok() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_to_end is the problematic part of this: what if the response was streaming a 3GB file from disk? Without compression, it will be streamed to the client piece by piece keeping only a bit in memory at a time. With compression, it now has to read all 3 GB into memory, compress it (keeping both uncompressed and compressed versions in memory for a short time), then send the compressed version to the client.

A safe approach would involve something like this (modulo parameters and error handling):

let reader = response.take_body().into_inner();
let new_body = brotli::CompressorReader::new(reader, buffer_size, q, lgwin);
Compression::set_body_and_header(response, new_body, "br");

(The same reasoning applies to gzip below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try your suggestion when #646 is resolved and I can compile Rocket.

However, flate2 API only supports to write &[u8], so I don't find any manner of converting the Read into a &[u8] without read_to_end or similar (I guess that take has the same problem).

Thanks for your help.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flate2::read::GzEncoder looks like it would work for this purpose (instead of flate2::write::GzEncoder)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that then I could use read to write in a streamed manner the bytes compressed into a buffer of my desired size, no?

It seems weird to me that read writes compressed data xD, that is why I didn't found this API before.

Thanks again.

Copy link
Contributor Author

@Martin1887 Martin1887 Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have finally done the changes in order to have streaming compression of requests.

I have also done some refactoring of the code removing unnecessary variables and making it more readable.

@jebrosen, @SergioBenitez, I kindly wait your reviews and comments.

@SergioBenitez
Copy link
Member

@jebrosen Could you check this out?

@SergioBenitez
Copy link
Member

@Martin1887 Please rebase.

Copy link
Collaborator

@jebrosen jebrosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plain body is no longer being read into a buffer, but the compressed body still is. That has to be fixed for the same reasons.

if cfg!(feature = "brotli_compression")
&& Compression::accepts_encoding(request, &["br", "brotli"])
{
let mut compressed = Cursor::new(Vec::<u8>::new());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Vec<u8> to temporarily hold compressed data is still an issue. The fairing should not do any reading/writing of bytes or buffer management itself on either the plain or compressed side; it should instead be left up to the Response/Body machinery to stream the response.

params.mode = BrotliEncoderMode::BROTLI_MODE_FONT;
}
}
brotli::BrotliCompress(&mut plain.into_inner(), &mut compressed, &params)
Copy link
Collaborator

@jebrosen jebrosen Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of brotli should read similarly to this code below to stream the response. I don't know what the appropriate values of q or lgwin might be, but the brotli documentation has some guidance.

let body = brotli::CompressorReader::new(plain.into_inner(), buffer_size, q, lgwin);
Compression::set_body_and_header(response, body, "br");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q is the quality, I am using 2 in this crate to have really fast compressions with compression ratio similar to gzip.

lgwin is the size of the window, whose default in the crate is 20. Recommended values are 20 and 22.

flate2::Compression::default(),
);
if match response.body() {
Some(plain) => gz.write(&plain.into_bytes().unwrap()).is_ok(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the brotli case above, a flate2::read::GzEncoder should be used as the body instead of writing into a buffer.

#[cfg(feature = "json")]
#[cfg_attr(feature = "json", macro_use)]
#[doc(hidden)]
pub mod json;

#[cfg(feature = "json")]
pub use json::{Json, SerdeError, JsonValue};
pub use json::{Json, JsonValue, SerdeError};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this done automatically by an editor/IDE? It's not related to the compression feature, so it could confuse someone going through the commit history later. Same with the templates:: imports below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, RLS did it. I will undone that changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A commit undoing the changes would be enough or does exist any git command to discard the changes done in that file on a specified commit (I have not found it)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, a commit to undo them would be enough; the history can be cleaned up later.


fn set_body_and_header<'h, B>(response: &mut Response<'h>, mut body: B, header: &'h str)
where
B: Read + Seek + 'h,
Copy link
Collaborator

@jebrosen jebrosen Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should not use Seek, to align with the other changes.

mdritchie and others added 12 commits July 4, 2018 20:04
… low performance hit, set compression mode to TEXT or FONT regarding the Content-Type of the response, images now are not compressed because it is not useful, fixed error when Accept-Encoding was a only header with all values in the same header
…n features, in addition to compression feature for all compression methods
…n features, in addition to compression feature for all compression methods
…on algorithm choose using the Content-Encoding header of the response
…ompression instead of read and write the whole response at once
jebrosen and others added 17 commits September 11, 2018 15:13
Adds the following to the list of known media types:

  * audio/webm (WEBA)
  * image/tiff (TIFF)
  * audio/aac (AAC)
  * text/calendar (Calendar)
  * video/mpeg (MPEG)
  * application/x-tar (TAR)
  * application/gzip (GZIP)
  * video/quicktime (MOV)
  * video/mp4 (MP4)

Also adds the following to the list of known extensions:

  * .weba (WEBA)
  * .ogv (OGG)
  * .mp4 (MP4)
  * .mpeg4 (MP4)
  * .aac (AAC)
  * .ics (Calendar)
  * .bin (Binary)
  * .mpg (MPEG)
  * .mpeg (MPEG)
  * .tar (TAR)
  * .gz (GZIP)
  * .tif (TIFF)
  * .tiff (TIFF)
  * .mov (MOV)
The new 'JsonError' type allows users to inspect the raw string that
failed to parse as a given JSON value.

Resolves rwf2#772.
The new 'FromData' trait allows an implementor to instruct the caller to
maintain state on its stack and later pass a borrow for processing.
Among other things, it greatly simplifies the 'Form' type, removing a
use of unsafe, and allows references in deserialized data guards.
@Martin1887
Copy link
Contributor Author

My Pull Request has been merged and the new crates.io release will be launched on the next week.

Then, I will commit and push my changes after test them.

@Martin1887
Copy link
Contributor Author

I have committed the last changes and rebased master.

All Compression tests pass and I have also tested the library in another independent project using my working copy as crate path.

I hope that this pull request is finally ready to be merged 🙂.

@jebrosen
Copy link
Collaborator

Please rebase or reorder commits again against the lastest master if you can; many unrelated commits are showing up in the diff and it's hard to filter through the relevant changes.

I only noticed one problem with the latest implementation: any response that already has a Content-Encoding header will be re-compressed. Compression should only happen if there is not already a Content-Encoding header, or if the header's value is identity.

Everything else looks reasonable to me, just nitpicks and stylistic changes.

@Martin1887
Copy link
Contributor Author

I have added the Content-Encoding header filter (only responses with Content-Encoding or with Content-Encoding identity and/or chunked are encoded) and the tests related to this new behavior.

Sorry, but I am not very skillful in git. All commits in this pull request contain Brotli (the firsts ones) or [CompressionFairing]. I think that this can easy the merge work. Anyway, if you know a good tutorial about reordering commits, rebase, cherry pick, squash, etc. I will appreciate it.

Thanks.

@jebrosen
Copy link
Collaborator

Sorry, but I am not very skillful in git. All commits in this pull request contain Brotli (the firsts ones) or [CompressionFairing]. I think that this can easy the merge work. Anyway, if you know a good tutorial about reordering commits, rebase, cherry pick, squash, etc. I will appreciate it.

I'm not sure exactly what happened, but I did try rebasing it myself and things got pretty hairy. Would you be able to make a new branch off of the upstream master branch, copy and paste in all of your changes as a single commit, and make a new pull request from that branch? My guess is that adding these commits directly to your master branch instead of a separate feature branch made the rebasing and merging harder later.

@Martin1887
Copy link
Contributor Author

@jebrosen, the pull request #781 has been created with only a commit different from the upstream/master branch.

Regards.

@SergioBenitez
Copy link
Member

Closing in favor of #781.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.