-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider shared caching #22
Comments
I hope this is a reasonable place to comment. (If not please tell me where to go.) I've been working on content addressing systems for several years. I understand that content addresses, which are "locationless," are inherently in conflict with the same-origin policy, which is location-based. An additional/alternate solution is for a list of acceptable hashes to be published by the server at a well-known location. For example, the user agent could request This does add some complexity both for user agents and for site admins. On the other hand, the security implications are well understood, and wouldn't require new permission logic. Thanks for your work on SRI. |
An interesting idea (although I know many folks who are vehemently against well-known location solutions, but I won't pretend to fully grasp why). If implemented, though, it would still require a round trip to get .well-known/sri-list, right? Which seems to lose a lot of the benefit of these acting as libraries. Another suggestion, that I think I heard somewhere, is, if the page includes a CSP, only use an x-origin cache for an integrity attribute resource if the CSP includes the integrity value in the script-hash whitelist. I think this would address @mozfreddyb's concerns listed in Synzvato/decentraleyes#26, but I haven't thought too hard about it. On the other hand, it also starts to look really weird and complicated :-/ Also, these solutions don't address timing attacks with x-origin caches. Although, as a side not, someone recently pointed out to me that history timing attacks in this case are probably not too concerning from a security perspective since it's a "one-shot" timing attack. That is, the resource is definitively loaded after the attack happens, so you can't attempt the timing again, and that makes the timing attack much more difficult to pull off, since timing attacks usually rely on repeated measurement. |
Using a script-hash whitelist in the HTTP headers (as part of CSP or separately) is better for a small number of hashes, since it doesn't require an extra round trip. Using a well-known list is better for a large number of hashes, since it can be cached for a long time. I agree that well-known locations are ugly. Although it works for /robots.txt and /favicon.ico, there is a high cost for introducing new ones. The privacy problem is worse than timing attacks: if you control the server, you can tell that no request is ever made. This seems insurmountable for cross-origin caching. Perhaps the gulf between hashes and locations is too large to span. For true content-addressing systems (like what I'm working on), my preference is to treat all hashes as a single origin (so they can't reference or be referenced by location-based resources). Thanks for your quick reply! |
I'd be slightly more interested in blessing the hashes for cross-origin caches by mentioning in the CSP. The idea to separate hashed resources into their own origin is interesting, but I don't feel comfortable drilling holes that deep into the existing weirdness of origins. |
To be clear, giving hashes their own origin only makes sense if you are loading top-level resources by hash. In that case, you can give access to all other hashes, but prohibit access to ordinary URLs. But that is a long way off for any web browsers and far from the scope of SRI. |
For the record, @hillbrad wrote a great document outlining the privacy and security risks of shared caching: https://hillbrad.github.io/sri-addressable-caching/sri-addressable-caching.html |
That document doesn't appear to consider an opt-in approach. While this would reduce the number of people who do it it could be quite useful. <script src=jquery.js integrity="..." public/> This tag should only be put on scripts for which timing is not an issue. Of course deciding what is That being said if FF had a way to turn this on now I would enable it, I don't see the privacy hit to be large and the performance would be nice to have. |
If I want to use the presence of my script in a shared cache to track you
illicitly, I will deliberately set the public flag, even if the content
isn't actually public.
…On Mon, Oct 31, 2016 at 3:06 PM Kevin Cox ***@***.***> wrote:
That document doesn't appear to consider an opt-in approach. While this
would reduce the number of people who do it it could be quite useful.
<script src=jquery.js integrity="..." public/>
This tag should only be put on scripts for which timing is not an issue.
Of course deciding what is pubic is now the responsibility of the
website. However since the benefit would be negligible for anything that is
website specific this might be pretty clear. For example loading a script
specific to my site has a single URL anyways, so I may as well not put
public otherwise malicious sites can figure out who has been to my site
recently even though I don't get any benefit from the content-addressed
cache. However if I am including jQuery there will be a benefit because
there are many different copies on the internet and at the same time it
means that knowing whether a user has jQuery in their cache is much less
identifying.
That being said if FF had a way to turn this on now I would enable it, I
don't see the privacy hit to be large and the performance would be nice to
have.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACFbcMQKkDaic1pBKylEUbpeHMoE2GLOks5q5mZbgaJpZM4G5Tap>
.
|
On 21/12/16 01:07, Brad Hill wrote:
If I want to use the presence of my script in a shared cache to track you
illicitly, I will deliberately set the public flag, even if the content
isn't actually public.
If you want to track me and you control both origins you want to track
me from you can just use the same URL and you get a cookie which is
better tracking and works today.
This is about preventing a third-party site having a script with the
same hash as for example a script on Facebook, then they can tell if you
have been to facebook "recently". However since fb hosts the script they
won't set it as "public" and so it won't be a problem.
I don't understand what threat you are trying to protect against.
|
A "public" flag seems like a good solution to me. It seems to encapsulate both the benefits and the drawbacks of shared caching. It says, "yes, you can share files publicly, but that means anyone can see them." That said, if it's opt-in, there's the question of how many sites would actually use it, and whether it's worth the trouble. Especially if it has to be set in HTML, rather than say by CDNs automatically. Maybe it would work better as an HTTP header? |
Setting in the HTML doesn't seem to be a big problem. If large CDN providers include this in their example script/style tags then sites will copy and paste support for this. A similar approach is currently being used for SRI and although it's not as fast as I'd like, usage will slowly grow. Sites that are also looking for those extra performance boosts would be keen to implement it. |
The idea of a public header (or even another key in At the end of the day I have not major objections to either option though. |
@kevincox Yes, I was suspecting that The Cache-Control security concerns (cache poisoning, accidentally caching sensitive information) are prevented by hashing. The only remaining security consideration is information leaks, which I'm not opposed to using an HTML attribute instead, but I think it's good to reuse existing mechanisms when they fit. Caching has traditionally been controlled via HTTP, not HTML. There are a few other ways to break this down:
I think that thinking about it in terms of "which method is easier for non-expert webmasters to deploy?" is likely to lead to a suboptimal solution. Yes some people don't know how to set HTTP headers, and some hosts don't let users set them, but in that case they are already stuck with limited caching options. Unless we're going to expose all of |
@btrask A website highly concerned about privacy and loading |
@brillout: Yes, good point. Using a mechanism not in the page source defeats the purpose, when the page source is the only trusted information. Thanks for the tip! |
@metromoxie Are we missing any pieces? The two concerns are;
Solution to privacy: We can make the shared cache an opt-in option via an HTML attribute. I'd say it to be enough. (But if we want more protection then browsers could add a resource to the shared cache only when many domains use that resource. As described in https://hillbrad.github.io/sri-addressable-caching/sri-addressable-caching.html#solution and w3c/webappsec#504 (comment)). Solution to CSP: UA should treat scripts with enabled shared cache as inline scripts. (As described here w3c/webappsec#504 (comment).) It would be super exciting to be able to use bunch of web components using different frontend frameworks behind the web component curtain. A date picker using Angular, an infinite scroll using React and a video player using Vue. This is currently prohibitive KB-wise but a shared cache would allow it. And with WebAssembly the sizes of libraries will get bigger increasing the need of such shared cache. @nomeata Funny to see you on this thread, the world is small |
An opt-in privacy leak isn't a great feature to have. |
How about opt-in + a resource is added to the shared cache only after the resource has been loaded by several domains? |
I don't think that really helps as the attacker can purchase two domains quite easily.
|
Yes it can't be |
CSP has (is getting?) a nonce-based approach. IIUC the concern with CSP is that an attacker would be able to inject a script that loaded an outdated/insecure library through the cache, thus bypassing controls based on origin. However requiring nonces for SRI-based caching seems to solve this issue as the attacker wouldn't know the nonce; it also creates a performance incentive for websites to move to nonces, which are more secure than domain whitelists for the same reason[1]. I think it's possible that we could solve the privacy problem by requiring a certain number of domains to reference the script... it'd be really useful to have some metrics from browser telemetry here. For example if we determined that enough users encountered e.g. a reference to jQuery in >100 domains for that to be the minimum, it might be that we could load things from an SRI cache if they had been encountered in 100+ distinct top-level document domains (i.e. domains the user explicitly browsed to, not that were loaded in a frame or something). The idea being that because of the top-level document requirement, the attacker would have to socially engineer the user into visiting 100 domains, which would be very, very difficult. However if telemetry told us that 100 is too high a number and it's actually more like 20 for a particular jQuery version, that'd be a different story. [1]: consider e.g. being able to load an insecure Angular version from the Google CDN because the site loaded jQuery from the Google CDN |
For some domains that file could be too large and change too often. Consider Tumblr's image hosting (##.media.tumblr.com) where each of the domain names host billions of files and the list changes every second. How about something similar to HTTP ETag but with a client-specified hash algorithm. If the hash is correct you only get a response affirming as much instead of the entire file, which the browser can cache. It doesn't save you the round trip but it saves you the data. |
From a privacy perspective, could we make it so that the resource is loaded from each origin at least once (if for no other reason than to verify that the SRI hash is valid). The browser could still then only cache one instance of it (and re-use whatever compilation cache etc that it deems relevant) but only stores that information once (and with various weightings etc the file may persist in cache for longer). This removes some of the benefit that user agents could get from a "first load" perspective, but solves the privacy issue and keeps some of the other benefits. As a side note, this could actually be implemented without the use of SRI hashes. If the browser links together identical files based on contents (eg stored against a hash), then it could perform this kind of optimisation irrespective of whether the website declares SRI hashes. |
@MatthewSteeples which benefits remain? If the browser only downloads but skips compilation, the privacy problems resurface via timing attacks. |
@ArneBab while theoretically possible, we're talking about a one-shot attempt to time how long it took the browser to compile something. You couldn't do repeated measurements to benchmark the speed of the device, or know what else was happening at the same time, so I'm not sure how reliable the numbers would be unless you're targeting a significantly large JS file. Would the same be true for CSS files? If it's still too much of a privacy risk, you could still have the battery benefit by just sleeping for how long the compilation took last time |
@MatthewSteeples they could provide other files with intentional changes to benchmark the browser during the access, and sleeping can be detected, because it can speed up the other compiles. So you don’t really win much in exchange for giving up the benefits of not accessing the site at all. For CSS files this is true, too. As an example you can take this page with minimal resources which shows significant parse-time in Firefox. But it would be possible to provide real privacy with a browser-provided whitelist and canonical URLs. That keeps the benefit of already having the file locally most of the time. So the core question is: if you download (and compile, because otherwise this is detectable), even though you have the file locally, which benefits remain? Are there benefits that remain? |
A shared cache does definitely bring a lot of advantages (faster sites, less data usage for the user, less network usage for the ISPs, browsers could cache the compiled/interpreted files, etc). From what I read in this thread, the main pushback is the privacy concern that a specific user could be tracked by checking whether he has a specific file cached or not, meaning that we can know if the user visited a site (or same site) before that had the same file included. The solutions I see for the privacy concerns:
I think that the shared cache is a lot better from a privacy point of view than including the resources from a 3rd party domain. So, although it allows some sort of tracking, it is still a step forward from just having all the websites linking to the same file on a CDN. |
You're misreading. The main pushback is the security concern. The privacy concern is already existing for CDNs and browsers are fighting it. Safari calls it "partitioned cache" I'm afraid this will never be. |
@hillbrad wrote a great document outlining the privacy and security risks of shared caching: https://hillbrad.github.io/sri-addressable-caching/sri-addressable-caching.html |
(@annevk asked me to unlock the conversation. I'm not too hopeful about seeing new information in this 5 year old thread.) |
I don't think there is a fundamental problem here that makes this impossible? For example, if browsers were to cache popular libraries such as React and Vue, then this wouldn't pose any problems, correct? If we can find a technique ensuring that only popular library code is cached (instead of unique app code), then we solve the problem, right? (I'm assuming that Subresource Integrity Addressable Caching covers all known issues). Could we maybe reopen this ticket? I'd argue that as long as we don't find a fundamental blocker, then having a cross-origin shared cache is still open for consideration. The benefits would be huge... it seems very well worth it to further dig. |
The other comment just before my last one has a newish - imho fundamental - blocker. Browsers are already partitioning their cache per first-level site (eventually more granular. Maybe per origin, or per frame-tree). This issue just turned 7 years old. I'll leave this issue closed because nobody has managed to come up with an idea since. New issues are cheap. I'm still happy to discuss new and specific proposals - I just currently do not believe those to exist. |
Do you mean https://terjanq.github.io/Bug-Bounty/Google/cache-attack-06jd2d2mz2r0/index.html? In other words, the consequences are worse than initially thought: it's not only a privacy concern, but also a security concern. For example, it enables attackers to use brute-force attacks to guess private data such as a password saved in Google Keep. (Because Google Keep loads different assets depending on whether Google Keep's search returns 0 results, as explained in VI. Google Keep > Vulnerable resource.) I also think it's a fundamental blocker for small assets such as individual images of an icon library. That said, I can still see it to be possible to have a shared cache for widespread assets such as React or Angular. Just for the sake of the argument and regardless of feasibility, if websites can declare dependencies in a global manner (e.g. "Google Search" being able to say "I depend on Angular"), then AFAICT this doesn't pose any problems. Another more interesting example: "Google Keep" can declare that it uses the font "Inter". The interesting thing here is that this doesn't suffer the security issue that I described above, because the dependency is defined globally instead of being defined page-by-page. As for privacy, it's paramount that only widespread assets (e.g. React, Angular, Inter, ...) can be shared-cached. No code unique to a website/page should ever be shared-cached. All-in-all I can't see any problems with such high-level goal of enabling websites to globally declare dependencies on widespread assets. Or am I missing something? While it's challenging to find a concrete technique for implementing this high-level goal (e.g. I'm not sure how a "website" can "globally declare" its "dependency" on an "asset"), I think there is still hope. Thanks for the discussion, I'm glad if we can bring everyone interested in this on the same page. |
This has been investigated in depth by both Google and Mozilla. You're welcome to try again, but the bar at this point is indeed a concrete proposal. This was essentially only found to work if you bundle the libraries with the browser, which creates all kinds of ecosystem problems. |
@mozfreddyb I’m sorry, I’m confused. From latest to earliest of your comments, there is the one I’m replying to, then a procedural one, then one that links to Brad Hill’s writeup (which @brillout already mentioned, although without addressing the cross-origin laundering issue—the mention of |
Fair enough, I agree that I haven't been super clear throghout this thread and that I did not re-read through all of it for every comment I submitted.
By "new-ish" I meant "not captured in Brad Hill's doc". Does that answer your question? -- @annevk said:
Indeed, the bar is "come up with a proposal that addresses all of these issues" AND either avoids bundling a library into the browser - or does it but then address the (imho) significant ecosystem issues that Anne mentioned. For those new to this, I also found Alex Russel's blog post Cache and Prizes a decent summary which involves the concerns with bundling. To be extra clear, it's not my intention to be gatekeeping or blocking any progress here. I just want to share what we've discussed and considered, because we thought about it for a very long time. All in all, it could be nice if it was solved well. For new, concrete proposals please open up a new issue. |
I'm glad to hear that there is still interest.
Sounds good. Challenge accepted! I'll be mulling over all of this in the next coming days. I've a couple of design ideas already. I'll report back. Thanks for giving me the opportunity to (maybe) make a dent here. I'd be honored.
Yes, I agree. As an "underdog OSS developer" (I'm the author of vite-plugin-ssr), I'm particularly attached to foster innovation.
That's an inherent drawback of any shared cache, but, considering other aspects, I actually see a shared cache to be a net win for innovation. (I'll elaborate if I manage to find a viable design.)
Yes, it's on my radar as well. While I doubt we can do much about the flat distribution aspect (since
Yes. I strongly believe a shared cache shouldn't be included in the initial browser download. (Or it should include very few things that have like a 99% chance of being downloaded upon the first few websites the user visits.) A shared cached should grow organically, otherwise it becomes a governance mess. |
The following design (tentatively) addresses all timing attack concerns (regarding both privacy and security). If the feedback is positive, we can create a proper RFC in a new GitHub ticket. The Assets Manifest A website // https://my-domain.com/assets.json
{
"assets": {
"react": {
"src": "https://some-cdn.com/react/18.2.0",
"integrity": "sha384-oqVuAfXRKap7fdgcCY5uykM6+R9GqQ8K/uxy9rx7HNQlGYl1kPzQho1wx4JwY8wC"
"type": "module"
}
}
} <html>
<head>
<script name="react"></script>
</head>
<html> Having shared assets defined on a per-domain fashion solves the problem of timing attacks determining the user's activity on a given website. (Assets defined in Protect user privacy In order to protect users from timing attacks retrieving the user's browsing history, the shared cache should behave as in the following example.
While it may seem surprising at first, the guarantee that a resource is added to the shared cache only after the browser knows it's used by two distinct domains is actually enough to protect users from privacy attacks. Let me elaborate. Let's consider following example:
This means that, at this point, all 4 assets (Angular, React, Open Sans font, Inter font) are in the shared cache, which seems like a glaring privacy leak. But it's actually not. While it's true that the combination React + Inter uniquely identifies To be on the safe side, I still think that the browser should be slightly be more conservative and add assets to the shared cache only after the user visted Increase shared cache effeciency For a library Edge platforms such as Cloudflare Workers make it relatively easy to implement a performant Conclusion I expect questions (especially around privacy) that I'm happy to answer. If, after discussing this we come to the conclusion it's worthwhile to pursue this direction, we can create a proper RFC. I'm very excited about the impact we may achieve, especially for low-end devices and low-end networks. I'm very much looking forward to (critical) feedback. |
@brillout I’m not sure I like seeing that bound to a domain, but I do see two upsides:
|
If a set of companies/domains, for example, alibaba.com and tencent.com and so on uses a library like antdesign-custom.js which is uniquely used by only them we could maybe uniquely identify if a user has visited those specific set of domains |
On that example, I can, across 2 domains (perhaps in an iframe or redirection chain), can do a timing attack to work out that facebook was visited or not (first domain full loads Open Sans, second domain may/may not), which then discloses if google or discord was visited in turn from the prior timing attack on React + Inter. up to |
@Summertime @arjunindia Yes, we should take protective measures about this. Example of a very aggressive strategy:
This is very conservative strategy and I don't think we need to go that far, but it shows that it's in the realm of the possible to address the issue. I'd even argue that such strategy can be made so effective that we can skip the whole I chose the The motivation of this RFC seed is to move the problem from the realm of "very unlikely to ever happen" towards the realm of "possible to implement and worth further investigation".
I like that idea, although I'm thinking maybe we should discuss it in a separate "RFC extension". I'm inclined to keep the conversation about the RFC's core propositions and, at some point later and if we can establish confidence around the RFC, we extend the scope of the conversation. |
A weakness here is resources correlating with user interests (A clear-cut example being peer-to-peer libraries). What if a large site starts pinning a version? What if websites around a topic start pinning a version? What's the diversity of requests.
Even when you can't pinpoint a site with certainty, doesn't mean it doesn't leak information. How do you ensure that any entropy gained from the set of resources in cache is lost among the noise? How do you do this for all light, medium, and heavy users of the web, regardless of the individual?
Required reading: Differential Privacy primer by minutephysics and the US Census Bureau: https://www.youtube.com/watch?v=pT19VwBAqKA |
There’s still the possibility not to automate it, but instead to have a central list of assets provided by the browser — similar to the decentral eyes extension — or maybe just using it: https://decentraleyes.org/ This sidesteps all the privacy issues and still brings large parts of the benefits. The only part it doesn’t provide is automatic inclusion of new versions, so it disincentivizes updates of common libraries. |
@AviKav Yes, the shared cache may indeed leak information about the user if it contains a resource that is (almost) only used by one specific kind of websites, e.g. some JavaScript library for peer-to-peer websites such as The Pirate Bay, or some kind of font used primarily in websites of a certain age group (e.g. a Pokemon font). Even though it's possible to take further protective measures (e.g. the shared-cache CDN should clearly communicate that topic specific libraries shouldn't be included, the CDN should enforce websites to specify their topic — e.g. with schema.org — and accordingly remove resources from the CDN if the resource correlates with a topic, collect further statistics about what ressources is used on what websites, etc.) I think at this point we put too much requirements on the shared-cache CDN. Ideally a shared-cache CDN shouldn't be too complex. Instead of adding further requirements to shared-cache CDNs, we can tackle the privacy problem from the other side: how can we reduce the opportunities for a malicious entity to time attack the shared cache? For example, we can require the browser to use the shared cache only upon a user activity: if the user navigates to a website by a manual action such as moving her physical mouse over a link or by tapping her touch screen, then and only then are resources loaded from the shared cache. This drastically reduces the opportunity for timing attacks. I'll elaborate more on this later. @ArneBab The idea of whitelisting resources has already been brought up and has been disregarded so far (I believe rightfully so). |
We've had a lot of discussions about using SRI for shared caching (see https://lists.w3.org/Archives/Public/public-webappsec/2015May/0095.html for example). An explicit issue was filed at w3c/webappsec#504 suggesting a sharedcache attribute to imply that shared caching is OK. We should consider leveraging SRI for more aggressive caching.
The text was updated successfully, but these errors were encountered: