Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ETag and Last-Modified headers for summary and summary.sig #2205

Merged

Conversation

pwithnall
Copy link
Member

See the commit messages. This adds support (in the soup and curl backends) for the Last-Modified/ETag/If-None-Match/If-Modified-Since HTTP headers and hooks them up for requests for summary and summary.sig so they don’t have to be re-downloaded if unchanged from last time. In particular, this will shave 590B off each response for summary.sig in the cached (common) case.

This PR is related to #2167 in that it’s a network efficiency improvement, but can be reviewed/merged independently.

@pwithnall
Copy link
Member Author

One possible improvement to think about after this (if it gets merged) is that the separate HTTP requests to get summary.sig and then (if it’s changed) summary could be combined: previously summary.sig was essentially acting as the ETag for summary, but once HTTP ETags are supported properly, then perhaps that could be rethought. If a client has an ETag for its cached summary, it could request summary from the server immediately (without requesting summary.sig first) and rely on the ETag it sends to make sure it only downloads the full summary file when it’s actually changed.

That would mean a shift to using inline signatures to protect summary, though (otherwise you still need to download summary.sig separately). I can’t remember all of the tradeoffs to know whether that’s helpful, unhelpful, or indifferent.

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall idea makes sense to me. Though worth noting there was a suggestion from @alexlarsson recently to have deltas for summary files as well which I think can affect the discussion here. E.g. if we were to support summary deltas, then presumably they'd be in some form which makes querying for "new summary wrt what I have locally" cheap, which intersects with what's going on here.

Though support for cache semantics makes sense on its own too and is compatible with existing OSTree webservers today.

* Wed, 21 Oct 2015 07:28:00 GMT
*/
static GDateTime *
parse_rfc2616_date_time (const char *buf,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah, is there really no helper function in any of glib, libsoup, or libcurl for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s soup_date_new_from_string(), but it’s in libsoup and this code is in the libcurl backend, so I assume libsoup is not an option there. Have I misunderstood?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No you're right, I missed that this is specific to libcurl and not in common with both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another big thing to keep in mind is that when updating host systems at least ostree usually runs as unrestricted root (I'd eventually like to fix this) and things like the string parsing we're doing here are an attack surface and need careful auditing.

To be clear did you write this code from scratch or is it inspired by anything else?

Let's at least add some unit tests for this?

We've already copied in some libsoup code for URIs, maybe we could copy in the libsoup code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear did you write this code from scratch or is it inspired by anything else?

It’s written from scratch.

Let's at least add some unit tests for this?

I’ll have a go.

We've already copied in some libsoup code for URIs, maybe we could copy in the libsoup code?

Maybe, but it was rewritten recently, so is in basically the same position as this code (except more general and has support for multiple RFC date formats tied together in the same code). The older version of the code was tied to SoupDate and also supported multiple date formats in one code path.

I think having a minimal in-house implementation makes sense, given that we only need to support the Last-Modified header, which only uses one specific HTTP date format (RFC 2616). If we were supporting more headers with other formats then maybe things would be different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I’ve split the RFC 2616 date parsing code out into a separate file so that it can be built into a test program, and added a test program which gives 100% coverage of lines (and, I think, branches — although the lcov setup in ostree doesn’t enable branch coverage). How’s that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the added tests!

Comment on lines 2722 to 2724
g_autoptr(GBytes) etag_bytes = NULL;

etag_bytes = glnx_fgetxattr_bytes (fd, "user.etag", NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: (here and elsewhere), coding style is declaration and initialization at the same time (or e.g. in the case of fd above, declaration would be right before the openat).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So declarations not necessarily at the start of code blocks? I thought there were problems with doing that with autocleanup, where the autocleanup would happen but the initialisation might not (if not reached by the time the block is exited)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as you initialize it to the "null" value, it's fine. When mixing it with the goto out pattern it gets tricky, but nowadays compilers will yell at you if you try to do something funky like jumping over an initialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Tweaked a couple of declarations to move them to where they’re used, but most declarations in the PR are assigned by pointer so can’t be changed.

src/libostree/ostree-repo-pull.c Show resolved Hide resolved
Comment on lines +75 to +76
char *summary_etag;
guint64 summary_last_modified; /* seconds since the epoch */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How necessary do you feel is ETag support? Feels like Last-Modified should be enough for our purposes, and it's nice that we don't have to dabble in user xattrs. But I'm not an expert in that field -- are there webservers out there which support ETag but not Last-Modified? (I'd expect the opposite.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ETag support is actually more important than Last-Modified support because it doesn’t have race conditions around updates/checks in the same second, and doesn’t have issues with clock synchronisation. Both require some level of support on the server (typically, aiui, servers are configured to serve everything as if it’s not cacheable, as that’s the safest option). I would expect that all web servers would support both headers, though. They’ve been around a long time.

If xattr support isn’t available then ETags won’t work, but should degrade nicely, so that shouldn’t cause a problem.

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have a basic test for this where we check that we avoid pulling the summary. Something like just doing another ostree pull and sanity-checking that the inode for the summary didn't change? (Since we always create a new file when caching.) Hmm, though on the ostree-trivial-httpd side, briefly looking at SoupServer, it seems like we'd have to implement that logic ourselves like this test here.

* Wed, 21 Oct 2015 07:28:00 GMT
*/
static GDateTime *
parse_rfc2616_date_time (const char *buf,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No you're right, I missed that this is specific to libcurl and not in common with both.

Comment on lines 2722 to 2724
g_autoptr(GBytes) etag_bytes = NULL;

etag_bytes = glnx_fgetxattr_bytes (fd, "user.etag", NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as you initialize it to the "null" value, it's fine. When mixing it with the goto out pattern it gets tricky, but nowadays compilers will yell at you if you try to do something funky like jumping over an initialization.

src/libostree/ostree-repo-pull.c Show resolved Hide resolved
const char *end_ptr = NULL;
gint saved_errno = 0;

g_return_val_if_fail (n_digits == 2 || n_digits == 4, FALSE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the role n_digits plays here. Doesn't look like it's actually used outside of this precondition check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, good catch. It was used before I refactored the function a bit, and then I forgot to plumb it back in. Will update.

Comment on lines 345 to 353
if (!is_file && response == 304)
{
/* Version on the server is unchanged from the version we have
* cached locally; report this as an out-argument, a zero-length
* response buffer, and no error. */
req->out_not_modified = TRUE;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: should we check here that at least one of if_none_match or if_modified_since was provided? That way, if somehow a HTTP server returns 304 even though we didn't provide either of these, we still error out as before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, will tweak things to do that (and in the libsoup implementation too).

@pwithnall
Copy link
Member Author

Would be nice to have a basic test for this where we check that we avoid pulling the summary

Working on that now; will hopefully have something ready tomorrow.

@pwithnall
Copy link
Member Author

Updated. I’ve fixed a couple of bugs and added a unit test (although see its commit message: it doesn’t actually fail with the ETag/Last-Modified changes reverted). All my changes are as fixups for ease of review, and can be squashed later.

strncasecmp (buffer, "ETag: ", strlen ("ETag: ")) == 0)
{
g_clear_pointer (&req->out_etag, g_free);
req->out_etag = g_strstrip (g_strdup (buffer + strlen ("ETag: ")));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g_strstrip() is dangerous here because if the buffer has leading whitespace we lose track of the real malloc pointer. I think you just want g_strchomp()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? g_strstrip(x) is just g_strchug(g_strchomp(x)), and g_strchug() doesn’t increment the input pointer, it uses memmove() to shift the string contents down over any leading whitespace. The pointer is unchanged.

* Wed, 21 Oct 2015 07:28:00 GMT
*/
static GDateTime *
parse_rfc2616_date_time (const char *buf,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another big thing to keep in mind is that when updating host systems at least ostree usually runs as unrestricted root (I'd eventually like to fix this) and things like the string parsing we're doing here are an attack surface and need careful auditing.

To be clear did you write this code from scratch or is it inspired by anything else?

Let's at least add some unit tests for this?

We've already copied in some libsoup code for URIs, maybe we could copy in the libsoup code?

src/libostree/ostree-repo-pull.c Outdated Show resolved Hide resolved
src/libostree/ostree-fetcher-soup.c Outdated Show resolved Hide resolved
src/libostree/ostree-fetcher-curl.c Show resolved Hide resolved
src/libostree/ostree-fetcher-curl.c Outdated Show resolved Hide resolved
src/libostree/ostree-fetcher-curl.c Show resolved Hide resolved
src/libostree/ostree-fetcher-curl.c Outdated Show resolved Hide resolved
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall! Really just minor things.

Could you squash the commits? We don't use the Homu bot here anymore which means that there is no autosquashing.

src/libostree/ostree-date-utils.c Show resolved Hide resolved
src/libostree/ostree-fetcher-curl.c Show resolved Hide resolved
src/libostree/ostree-fetcher-curl.c Outdated Show resolved Hide resolved
src/libostree/ostree-fetcher-curl.c Outdated Show resolved Hide resolved
src/libostree/ostree-fetcher-curl.c Show resolved Hide resolved
tests/test-rfc2616-dates.c Show resolved Hide resolved
Add support in the soup and curl fetchers to send the `If-None-Match`
and `If-Modified-Since` request headers, and pass on the `ETag` and
`Last-Modified` response headers.

This currently introduces no functional changes, but once call sites
provide the appropriate integration, this will allow HTTP caching to
happen with requests (typically with metadata requests, where the data
is not immutable due to being content-addressed). That should reduce
bandwidth requirements.

Signed-off-by: Philip Withnall <[email protected]>
As `summary` and `summary.sig` aren’t immutable, HTTP requests to
download them can be optimised by sending the `If-None-Match` and
`If-Modified-Since` headers to avoid unnecessarily re-downloading them
if they haven’t changed since last being checked.

Hook them up to the new support for that in the fetcher.

The `ETag` and `Last-Modified` for each file in the cache are stored as
the `user.etag` xattr and the mtime, respectively. For flatpak, for
example, this affects the cached files in
`~/.local/share/flatpak/repo/tmp/cache/summaries`.

If xattrs aren’t supported, or if the server doesn’t support the caching
headers, the pull behaviour is unchanged from before.

Signed-off-by: Philip Withnall <[email protected]>
This test would have actually passed before the summary file caching
changes (in the previous few commits) were added, as the `summary.sig`
essentially acted as the ETag for the summary file, and itself wasn’t
updated on disk if it didn’t change when querying the server.

Actually testing that the HTTP caching headers are working to reduce
HTTP traffic would require test hooks into the pull code or the
trivial-httpd server, neither of which I have the time to add at the
moment.

Signed-off-by: Philip Withnall <[email protected]>
This is basic support for the
Last-Modified/ETag/If-Modified-Since/If-None-Match headers. It’s not
high performance, and doesn’t support all of the related caching
features (like the If-Match header, etc.).

Signed-off-by: Philip Withnall <[email protected]>
This makes it testable, and increases its test coverage too 100% of
lines, as measured by `make coverage`.

Signed-off-by: Philip Withnall <[email protected]>
@pwithnall pwithnall force-pushed the etags-and-last-modified branch from 4aafdc1 to 0974a7f Compare October 22, 2020 20:05
@pwithnall
Copy link
Member Author

All review comments addressed (or commented on) and all the fixups squashed.

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

* Wed, 21 Oct 2015 07:28:00 GMT
*/
static GDateTime *
parse_rfc2616_date_time (const char *buf,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the added tests!

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, jlebon, pwithnall

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,jlebon,pwithnall]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cgwalters
Copy link
Member

Looks like Travis is stuck, hmm. Not sure how to retrigger that one without another force push. Let's give it a bit and see if it comes back, otherwise we can override.

@openshift-merge-robot openshift-merge-robot merged commit f8f6464 into ostreedev:master Oct 22, 2020
@pwithnall pwithnall deleted the etags-and-last-modified branch October 22, 2020 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants