Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Vary sampling rate in image metadata cache, and back off on HTTP 429 #1354

Merged
merged 4 commits into from
Sep 18, 2018

Conversation

squaremo
Copy link
Member

Vary the sampling rate

To drive the refresh of cache data, we expire entries after a
configurable duration (default of 1 hour).

Because many image tags never change what they point to, this means
the image cache does thousands of needless fetches every hour (or
whatever the configured duration is). On the other hand, if you set it
to much longer (say 24h), any changes to a tag won't be noticed for
that much longer.

This commit gives each manifest its own, adaptive schedule for being
refreshed. It does this by doubling the period when a freshly fetched
manifest does not differ from the existing entry, and halving it when
it does differ. (The refresh period is clipped to [minRefresh, maxRefresh])

A tag that doesn't change will end up being polled infrequently. A tag
that changes occasionally will alternate between a number of values,
depending on how regular the changes are.

There are surely more accurate ways of arriving at a good estimate for
the polling frequency based on the past samples; this one has the
advantage of being very simple, and requiring little state to be kept.

Back off on HTTP 429

Image registries tend to have rate limiting, and it is hard to know in
advance what the practical limits are -- and they will often
change. Nonetheless, we have to back off when we get a HTTP 429,
otherwise the registry could e.g., decide to blacklist our IP.

Therefore:

  • decrease the rate limit for a registry domain if we get HTTP 429

    • we only reliably see this in the round-tripper, but that's where
      the rate limiting is anyway
    • this goes through the RateLimiters struct so that we can log it
  • we also want to edge back towards the configured rate when we have
    succeeded

    • but we only know if we succeeded in the calling code, so it has
      to be wired through

@squaremo squaremo force-pushed the rewrite/image-fetch-backoff branch from a4084ae to 23448bb Compare September 12, 2018 15:07
To drive the refresh of cache data, we expire entries after a
configurable duration (default of 1 hour).

Because many image tags never change what they point to, this means
the image cache does thousands of needless fetches every hour (or
whatever the configured duration is). On the other hand, if you set it
to much longer (say 24h), any changes to a tag won't be noticed for
that much longer.

This commit gives each manifest its own, adaptive schedule for being
refreshed. It does this by doubling the period when a freshly fetched
manifest does not differ from the existing entry, and halving it when
it does differ. (The refresh period is clipped to `[minRefresh,
maxRefresh]`)

A tag that doesn't change will end up being polled infrequently. A tag
that changes occasionally will alternate between a number of values,
depending on how regular the changes are.

There are surely more accurate ways of arriving at a good estimate for
the polling frequency based on the past samples; this one has the
advantage of being very simple, and requiring little state to be kept.

Note that the expiry of the keys is now distinct from the deadline for
being refreshed. These were previously conflated, with the refresh
"timed" so it would fetch new metadata just before an entry was due to
expire. The expiry is set so that there is a generous grace period
after the deadline for a new value to be fetched and written.
Image registries tend to have rate limiting, and it is hard to know in
advance what the practical limits are -- and they will often
change. Nonetheless, we have to back off when we get a HTTP 429,
otherwise the registry could e.g., decide to blacklist our IP.

Therefore:

 - decrease the rate limit for a registry domain if we get HTTP 429
   - we only reliably see this in the round-tripper, but that's where
     the rate limiting is anyway
   - this goes through the RateLimiters struct so that we can log it

 - we also want to edge back towards the configured rate when we have
   succeeded
   - but we only know if we succeeded in the calling code, so it has
     to be wired through
To see whether the image cache refresh is doing what you expect, it's
useful for it to log what it's doing as it does it (and why it does
things). This commit adds some trace messages, switched on with the
`--registry-trace` flag, and tidies up some of the other logging.
@squaremo squaremo force-pushed the rewrite/image-fetch-backoff branch from 249a30a to 1c3a901 Compare September 12, 2018 15:17
@squaremo squaremo requested a review from rndstr September 14, 2018 09:46
Copy link
Contributor

@rndstr rndstr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants