Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negotiate cache_expiration_time #26

Merged
merged 15 commits into from
Sep 20, 2024
Merged

Conversation

Alkarex
Copy link
Member

@Alkarex Alkarex commented Sep 15, 2024

Negotiate the cache expiration time based on the HTTP response headers.
Return the cache duration time in number of seconds since the Unix Epoch, accounting for:

  • Cache-Control: max-age minus Age, bounded by $cache_duration_min and $cache_duration_max
  • Cache-Control: must-revalidate will set $cache_duration to $cache_duration_min
  • Cache-Control: no-cache will return time() + $cache_duration_min
  • Cache-Control: no-store will return time() + $cache_duration_min - 3 (to distinguish from no-cache if needed)
  • Expires like Cache-Control: max-age but only if it is absent

Parameters:

  • $cache_duration Desired cache duration in seconds, potentially overridden by HTTP response headers
  • $cache_duration_min Minimal cache duration (in seconds), overriding HTTP response headers Cache-Control and Expires
  • $cache_duration_max Maximal cache duration (in seconds), overriding HTTP response headers Cache-Control and Expires

- `Cache-Control: max-age` minus `Age`, extendable by `$simplepie_cache_duration`
- `Cache-Control: must-revalidate` will prevent `$simplepie_cache_duration` from extending past the `max-age`
- `Cache-Control: no-cache` will return the current time
- `Cache-Control: no-store` will return `0`
- `Expires` but only if `Cache-Control: max-age` is absent
@Alkarex
Copy link
Member Author

Alkarex commented Sep 15, 2024

@Frenzie , @marienfressinaud, @Art4 and others: discussion welcome.

At the moment SimplePie is not HTTP compliant when it comes to HTTP requests and caching. This PR addresses some of it.

But I have some doubts. On my local tests, I see plenty of feeds responding with Cache-Control: no-cache, no-store, max-age=0, must-revalidate or a variation of that. no-store being the strictest.

For instance, our own GitHub releases feed returns cache-control: max-age=0, private, must-revalidate (it does support conditional requests though, luckily).

So on FreshRSS instances with many users, there is a good likelihood of having several users subscribing to the same feed(s) (for instance this GitHub release feed), and preventing the cache from being used by the following users during a refresh cron job would be a waste of resources - but potentially good for servers which care about download statistics.

We could consider updating our heuristics based on e.g. Cache-Control: private. There are fewer feeds with this directive but still a number in my own test set, such as https://www.journalduhacker.net/rss (returns Cache-Control: max-age=0, private, must-revalidate) or https://www.blogger.com/feeds/8698702854482141883/posts/default (returns cache-control: private, max-age=0, must-revalidate, no-transform)

Thoughts?

@Frenzie
Copy link
Member

Frenzie commented Sep 15, 2024

At least on GH I would suspect it's closer to a mistake, intended to make sure the comments are up to date on a page like this one.

But you can certainly think of scenarios where it would be intentional. Perhaps something like a server minimum cache time constant at 5 or 10 minutes by default?

@Alkarex
Copy link
Member Author

Alkarex commented Sep 15, 2024

Perhaps something like a server minimum cache time constant at 5 or 10 minutes by default

I have introduced $cache_duration_min: Minimal cache duration (in seconds), overriding HTTP response headers Cache-Control: max-age and Expires, but without effect on Cache-Control: no-store and Cache-Control: no-cache and Cache-Control: must-revalidate

Its default is 60s but can be set higher. But not obeying Cache-Control: no-store or Cache-Control: no-cache or Cache-Control: must-revalidate would not be HTTP compliant.

Alkarex added a commit to Alkarex/FreshRSS that referenced this pull request Sep 15, 2024
@Alkarex Alkarex closed this Sep 15, 2024
@Alkarex Alkarex deleted the negociate_cache_expiration_time branch September 15, 2024 22:25
@Alkarex Alkarex restored the negociate_cache_expiration_time branch September 15, 2024 22:31
@Alkarex Alkarex reopened this Sep 15, 2024
@Alkarex Alkarex changed the title Negociate_cache_expiration_time Negotiate cache_expiration_time Sep 15, 2024
@Alkarex
Copy link
Member Author

Alkarex commented Sep 16, 2024

Changed logic, to account for the many servers disabling cache all-together. Simpler, and probably more pragmatic

@Frenzie
Copy link
Member

Frenzie commented Sep 16, 2024

You don't think 24h sounds a bit short as a default maximum if the site is explicitly asking for it? :-)

@Alkarex
Copy link
Member Author

Alkarex commented Sep 16, 2024

You don't think 24h sounds a bit short as a default maximum if the site is explicitly asking for it? :-)

Maybe, but I am balancing that with users, who will believe that the refresh is broken...

P.S.: And I can also see some feeds returning high values, which are most likely broken.
For instance 30d for https://gts-net.dk/feed/ during the conditional response (I need to debug that one a bit more).

@Frenzie
Copy link
Member

Frenzie commented Sep 16, 2024

Right, I'm not saying to just listen to whatever the feed might say, just that a few days still seems reasonable. But the point about users makes sense.

@Alkarex
Copy link
Member Author

Alkarex commented Sep 18, 2024

For some unknown reason, I observe some strange feeds (e.g. https://gts-net.dk/feed/ ), which are returning a much higher max-age (30d) for a 304 response than for the initial 200 response (1h).

$ curl -v -A 'FreshRSS/1.25.0-dev (Linux; https://freshrss.org)' -H "Accept: application/atom+xml, application/rss+xml, application/rdf+xml;q=0.9, application/xml;q=0.8, text/xml;q=0.8" 'https://gts-net.dk/feed/'
> GET /feed/ HTTP/2
> Host: gts-net.dk
> User-Agent: FreshRSS/1.25.0-dev (Linux; https://freshrss.org)
> Accept: application/atom+xml, application/rss+xml, application/rdf+xml;q=0.9, application/xml;q=0.8, text/xml;q=0.8
>
< HTTP/2 200
< date: Wed, 18 Sep 2024 10:38:12 GMT
< content-type: application/rss+xml; charset=UTF-8
< server: Apache
< link: <https://gts-net.dk/wp-json/>; rel="https://api.w.org/", <https://gts-net.dk/>; rel="canonical"
< x-tec-api-version: v1
< x-tec-api-root: https://gts-net.dk/wp-json/tribe/events/v1/
< x-tec-api-origin: https://gts-net.dk
< last-modified: Wed, 18 Sep 2024 10:24:19 GMT
< cache-control: max-age=3600
< expires: Wed, 18 Sep 2024 11:38:12 GMT
< vary: Accept-Encoding
< x-content-type-options: nosniff
< simplycom-server: Apache
< simplycom-server: nginx
< alt-svc: h3=":443"; ma=86400, h3-29=":443"
<
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
        xmlns:content="http://purl.org/rss/1.0/modules/content/"
        xmlns:wfw="http://wellformedweb.org/CommentAPI/"
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:atom="http://www.w3.org/2005/Atom"
        xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
        xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
        >
<!-- ... -->
</rss>

$ curl -v -A 'FreshRSS/1.25.0-dev (Linux; https://freshrss.org)' -H "Accept: application/atom+xml, application/rss+xml, application/rdf+xml;q=0.9, application/xml;q=0.8, text/xml;q=0.8" -H 'If-Modified-Since: Wed, 18 Sep 2024 10:24:19 GMT' 'https://gts-net.dk/feed/'
> GET /feed/ HTTP/2
> Host: gts-net.dk
> User-Agent: FreshRSS/1.25.0-dev (Linux; https://freshrss.org)
> Accept: application/atom+xml, application/rss+xml, application/rdf+xml;q=0.9, application/xml;q=0.8, text/xml;q=0.8
> If-Modified-Since: Wed, 18 Sep 2024 10:24:19 GMT
>
< HTTP/2 304
< date: Wed, 18 Sep 2024 10:41:23 GMT
< server: Apache
< last-modified: Wed, 18 Sep 2024 10:24:19 GMT
< cache-control: max-age=2592000
< expires: Fri, 18 Oct 2024 10:41:23 GMT
< x-content-type-options: nosniff
< simplycom-server: Apache
< vary: Accept-Encoding
< simplycom-server: nginx
<

@Frenzie
Copy link
Member

Frenzie commented Sep 18, 2024

Oof. :-/

@Alkarex
Copy link
Member Author

Alkarex commented Sep 20, 2024

Another feed with same problem: https://www.smappee.com/blog/feed/

< HTTP/2 200
< server: nginx
< date: Fri, 20 Sep 2024 13:46:47 GMT
< content-type: application/rss+xml; charset=UTF-8
< last-modified: Thu, 19 Sep 2024 14:26:27 GMT
< cache-control: max-age=3600
< expires: Fri, 20 Sep 2024 14:46:45 GMT
< HTTP/2 304
< server: nginx
< date: Fri, 20 Sep 2024 13:47:32 GMT
< last-modified: Thu, 19 Sep 2024 14:26:27 GMT
< cache-control: max-age=2592000
< expires: Sun, 20 Oct 2024 13:47:31 GMT

@Alkarex
Copy link
Member Author

Alkarex commented Sep 20, 2024

Well, when nginx is in the equation, it is often the culprit...
An explanation h5bp/server-configs-nginx#230

Problem:

  1. Expiration time is currently determined by content-type.

  2. Content-type is not set for 304 responses.

  3. This means 304 responses gets the default expiration time, which could be very different from the original.

@Alkarex
Copy link
Member Author

Alkarex commented Sep 20, 2024

Workaround implemented: Allow servers to return a shorter cache duration for 304 responses, but not longer than the value in the 200 response

@Alkarex
Copy link
Member Author

Alkarex commented Sep 20, 2024

Seems to work quite fine

@Alkarex Alkarex merged commit 7090eed into freshrss Sep 20, 2024
18 checks passed
@Alkarex Alkarex deleted the negociate_cache_expiration_time branch September 20, 2024 21:19
Alkarex added a commit to FreshRSS/FreshRSS that referenced this pull request Sep 20, 2024
* SimplePie support for HTTP cache policies
Discussion in FreshRSS/simplepie#26

* Bump SimplePie commit

* Typos

* Typos

* Simpler logic

* Explicitly disable cache for non-GET flows

* Bump SimplePie commit

* Bump SimplePie commit

* Bump SimplePie commit

* Bump SimplePie commit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants