-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client-Hints exposes fingerprint values to additional parties and logging sensitive locations #767
Comments
HI @snyderp. Be aware that people participate in the IETF as individuals, not organisations; if people in PING (aside for other readers: this is the W3C Privacy Interest Group) want to engage here, they'll need to do so individually. My understanding is that the current thinking around CH considers the requirement for a site to request a specific CH (in the form of the Accept-CH) as effectively "converting" passive collection to active collection; a researcher, for example, can measure how many sites are collecting such metrics, and a browser can alert the user when such collection is taking place. There was a substantial amount of discussion around this, including in #215, #372 and elsewhere. Are you questioning that, or are you only focusing on information leakage into logs? Regarding the latter, can you illustrate how that would happen, and what the impact would be? |
@mnot thank you for your reply.
Researchers and clients already do exactly the things you mentioned. CH doesn't help make preventing / measuring / notifying of FP & tracking any easier. I'm similarly not familiar (and don't see any related comment in CH tagged issues) to deprecate / remove the existing methods for retrieving the same information, so the responsibility / burden on privacy focused parities is strictly increased.
I am aware of this conversation, i referenced a commit in mentioned in #215 above for example :) . But in general, im not sure I understand the connection to the above issues to the concern here, which is "values in headers get treated categorically differently, and persisted longer, than variables in JS" :)
Neither issue addresses the concerns regarding logging (and more broadly, that putting FP sensitive values in headers increases the risk of long term privacy leaks / tracking).
I can go on :) But i hope this helps explain / motivate the concern further |
I wish you would have raised any of those concerns in the previous meetings we had with the PING (e.g. in the F2F breakout session).
The common alternative to CH is origins inspecting those values in JS, and injecting them into URL query parameters. That practice seems significantly more "loggable", making Client Hints a clear win on that front.
Client Hints are not "passive collection". Getting the hints requires an opt-in, which as @mnot said, is making leak and abuse of that hint data something that can be tracked, monitored and being acted against. (by researchers, extensions or user agents)
We are in the process of defining third-party delegation in the Fetch and HTML specifications. Perhaps the PING should focus on reviewing that, rather than the IETF draft language, which at least originally was destined for a broader audience, and therefore tends to be more vague.
Given that those concerns were not raised before in previous encounters with the PING or other privacy-minded folks, not a whole lot.
Since you're the one claiming that this is a significant issue, do you have data to suggest it is a problem?
I don't think they can see what different services do in their backends with information provided to them by default. For example, with
It does, by turning information (e.g. the UA string or the
If you're not familiar with it, feel free to ask :) @mikewest and I have presented such plans at the F2F TPAC breakout session, which I believe you attended. This is not mentioned in the IETF draft, as it's a feature that will use the CH infrastructure, and as such, is defined elsewhere.
Trackers rarely inspect fingerprinting data using JS APIs and then keep it on the device's memory. They typically send it to their servers, often as URL query parameters, which are arguably an order of magnitude more likely to be persisted in logs than HTTP headers.
Since Client-Hints are exposed only over secure connections, I suppose you meant "TLS terminators" above? CDNs do get access to that data, which enables them to perform their duties and use it for content negotiation purposes. They also have full access to the content as TLS terminators, so arguably could also have access to that data by running arbitrary JS on their customers sites. As CDNs are fully trusted by their customers not to do that unless the customer asks them to, I don't think they are the threat model here. Also, anecdotally, I've never heard of a CDN that keeps around the logs of all their customers request and response headers. If they would, cookies are likely to present a bigger concern than fingerprinting bit leakage. But if you have examples to the contrary, I'd love to hear them. As for other TLS terminators (e.g. MITM proxies), they can similarly inject arbitrary JS and leak arbitrary data. I doubt CH increases that attack surface.
I'm not convinced that this is in fact the case. And again, acquiring the data in JS does not equate keeping that data in JS. It is leaked over the network, typically as URL params.
Any specific examples of HTTP logs that contain all header values? I don't see Apache logs doing that, at least not by default. Similarly for Nginx, I don't see any headers there. I do see the URL though. |
Squid proxy does have some logging options to record the HTTP headers in full or in part. At least some installations use those as their regular log format, or did so not long ago. That said, having the CH values as opt-in instead of always present is a clear privacy improvement even for these installations. I do not agree with the argument that CH makes logging privacy/security issues worse. |
@yoavweiss , @yadij: If I understand correctly, the client/user do not have any control over CH opt-in. If the server sends a Accept-CH header requesting DPR, ECT and DeviceMemory. The client has no way to emit all/specific values like DeviceMemory. |
AIUI the opt-in has to be mutual. The server opts-in by indicating the CH and values it wants to see. The client opt-in is its choice whether to actually send that detail as requested. For clients who literally cannot send a datum, there is no way they would have been able to send it regardless of whether CH is supported or not. So the server relying on its presence with a non-nil value is a server design bug. |
To respond to some of the issues mentioned above @yoavweiss @yadij : Re: CH (does/does not) makes logging privacy issues worse I'm having trouble understanding the claim here. I am going to see if I can repeat the thinking behind the argument that "CH aids FP prevention / detection" argument. If I am off base here, please kindly correct, I'm making an honest effort to get on the same page. The arguments are: 1) Clients can say no easier
2) logging concerns are a push
3) Depreciation of current JS end points
Assuming this is a correct understanding of the argument, my concerns are the following
Also, fwiw, for examples of logs that collect all header data, the mod_log_forensic module, on the website you linked too, is one example. Systems like snort and bro, want full HTTP header logs, etc. More to the point though, the expectation is that as more track-able information winds up in these headers, more parties will become interested in them. Such is the way of the web ¯_(ツ)_/¯ |
@snyderp it seems like you're arguing that CH should be a net improvement of privacy, as opposed to current practice. While that's a goal that most everyone here shares, it isn't an explicit criteria that we've been applying to the spec; rather, the bar that I think we've held (perhaps implicitly) is that it's no worse than current practice, in any meaningful way. Is that the case? If so, it'd probably be best to have a discussion explicitly about the goals, so we can determine consensus and then consequences. |
Howdy @mnot: I am not arguing that CH should be a net improvement of privacy. My position is that CH is would result in something worse than current practice, for multiple reasons:
|
Thanks for the clarification. I think you need to qualify (1). As discussed, CH is currently only sent over TLS, and so from the perspective of the web, is communication between two parties. It's true that some sites contract out some of their server capabilities to others (CDNs, cloud hosts, data centres, etc.) but those are generally still considered first-party interactions, because the entity in question is acting on behalf of the owner of the server, and that relationship is typically overseen by a legal agreement, as well as various rights and responsibilities in different jurisdictions. Also, "third party" generally is used to refer to off-site services with a different origin; e.g., ads., so it's a bit confusing to use it here. Terminology aside, there is some precedent for the argument you're making in (2), but generally when the community has been concerned about sensitive data being logged, it's been because it appears in URLs, which were (originally) designed to be written down, logged, etc. Extending that argument to include headers in an encrypted connection is new. That's not a roadblock, but it would help immensely to establish and get agreement upon the underlying principle. For (3), I think the basis of the decision is likely to be whether the added functionality -- including privacy improvements, as we're seeing discussed in Mike West's Are you likely to be coming to the IETF meeting in Prague? That's probably where we're going to try to move this (and similar issues) forward. |
@mnot Thank you for the reply. I'm not sure I follow what you're asking for by "qualifying" the concern in (1) though. I'm happy to change terminology if it eases the conversation, but the concern seems pretty straight forward. User X wants to visit website Y who sits behind CDN (or similar) Z. Currently, if Z wants to extract these fingerprinting-sensitive values from X's conversation with Y, it requires some out of band communication with Y (access to Y's DB, or something like that, assuming Y is storing them). In a CH-world, Z gets well-structured access to those values, in a way they don't currently have. Thats a loss of privacy to X that does not currently exist. Re (2) again, happy to contribute anything I can to the conversation here, but not sure what else can be added beyond the original concern. Is it helpful frame it as:
(3) The concern here is not about implementation complexity. Just to try and reach some point of agreement, so that we can discuss further from there: Setting aside the UA aspect, and dealing only with viewport height/width, DPR, ECT and DeviceMemory, etc: can we agree that CH is strictly a harm for privacy? In the best case (user agent declines the server's request for the headers, site falls back to JS based value extraction) its a push, and in all other cases its a loss (fingerprinting values are in more locations than they were before, and accessible by more parties). If we don't agree to the above, i would greatly appreciate it if you could clarify by way of the following: what is a scenario where a) the server / website wants to get access to finger-printable values, and b) the client doesn't want to yield them, that CH helps improve privacy? (keeping in mind that a site that currently gets them through JS, will ask for the in CH, and if they don't get them, continue to ask for the in JS). If I could better understand a case that fit the above scenario, it might help me much better understand where ya'll are coming from, and where I may be misunderstanding things :) Unfortunately, I will not be able to join ya'll in Prague this year |
CDNs are delegates of the origin and are not considered part of the threat model. A rogue CDN (similar to a compromised server) can do way more damage than read out fingerprintable information.
We certainly can not agree on that! (and repeating that multiple times will not suddenly make it true) We went through great lengths to make sure the current mechanism does not allow passive fingerprinting, and its use can be treated by UAs similarly to the equivalent JS APIs. Even if we just discuss the viewport-width, DPR, ECT and DeviceMemory, the fact that the information is communicated in a convenient and standard way does not mean that it's easier to exploit it compared to the JS APIs. OTOH, it makes it easier to never keep it in or scrub it from server/CDN logs (e.g. compared to the same information hidden as URL parameter conventions).
In your scenario above, how do Client Hints make things worse? If the information is exposed through JS, it is exposed. Making it available through another channel doesn't add any new fingerprintable data that attackers can abuse. |
Keep in mind, there is remote participation. |
@mnot I would be very happy to participate in this conversation remotely when its happening in Prague. Any information you could link to / share about how to participate would be greatly appreciated. I am new to the mysterious world of IETF :) |
@snyderp welcome :) General resources listed here: Meeting details here: Best thing to do is to register for remote participation (free), and watch the agenda (linked from the meeting details above) for links to the audio and video feeds (should be added when the agenda is final). Jabber is also a primary communication channel during the meeting; [email protected]. |
This is addressing the concern by defining it away. If they're not part of the threat model, then they certainly aught to be. They're distinct parties, with potentially distinct (if any) commitments to the visitor, etc. To say a CDN could do something worse is unrelated; sure they could. But that is unrelated to whether it is a good idea to provide them with the data they need to easily finger print users! The relevant framework here is an honest-and-curious vs malicious distinction.
Just to lower the temperature in the room: I appreciate and don't mean to denigrate how much effort you've put into this. I'm sure its a lot of work, and I'm sincerely grateful folks like you are working hard to find ways to improve the web. :) But it doesn't change the privacy harm in CH as it stands…
More examples:
I understand and take your point that there are cases where a CDN or similar could guess at some of these values from some domains buy guessing at URLs, but again, there is a quantity-to-the-point-of-quality difference between CDNs using some manually curated set of per-domain-per-application-version regexs, and providing that data in a structured header field.
They make it worse by increasing the attack surface the client needs to defend against. I hope the above examples help explain. Also, making the same data in multiple places (and a different set of parties), making it much easier for middle parties to log and preserve the FP values, and relying on browsers to deploy additional countermeasure / standards-deviations to avoid further privacy harm are all examples of additional privacy risk / loss in the standard. But happy to have reached a point where we're at least not arguing that CH somehow reduces FP surface: (e.g. #767 (comment)) 😀 |
sorry to spam the thread but this has grown to three related, but distinctl concerns:
I'm happy to keep them all in this thread, but if folks would like me to spread out to different issues, i'm happy to do so too |
yoavweiss :
snyderp:
I think what Yoav should have said is that CDN are considered part of the origin. So there is no separate model. All privacy and security aspects for "origin" also apply to CDN. You keep stating that CH expands the exposure. But look at the exposure scopes in the threat matrix I list below. snyderp :
That was a statement by me. AFAIK yoav has never had that position. scenario 1: a request arrives for some random URL. This URL is stored, logged, and passed around. The response to this request has a response lacking cacheability headers. For performance vague responses are cached. Due to the permutability of query-string values there may be N copies of this URL+object stored in M caches around the world - for potentially 68 year long timespans. scenario 2: a request arrives with C-H header details. The URL is clean - so no danger from logging and passing that around within the intermediary system. The request C-H headers (being request headers) are not cached with the response (if they are used by Vary/Key etc it is in the form of a crypto hash). So tell me again how scenario #1 is better for privacy? synderp :
That is actually a case near worst-case. The exposure matrix is a 2x3 [ [JS, CH], [send, fallback, omit] ]. So these: JS-only (the status quo):
CH-only
CH with JS fallback
JS with CH fallback
Both CH and JS data
Neither
One may argue that agents without JS support are being added to the exposure set. However I counter that the FP data is already visible to such agents in the form of URL values. The presence of FP data in URLs is already where the worst types of leaks are occuring with the JS-only approach. Simply closing off those major avenues of exposure is the reduction in surface I referred to earlier and still believe is offered by CH. |
Similar to how the user can block scripts from running in his browser, he could block the CH headers from being sent as well (and it is not significantly harder to do so). These privacy protecting extensions like noscript, privacy badger, etc. could be updated to remove Client Hints from non-origin requests. If the user is willing to trade off performance for privacy, he has the choice to do so that way. Privacy focused extensions (or browsers) would have a safe way to block this than trying to strip off query parameters off a URL for example. I also agree that CDNs in front of the origin shouldn't be considered as part of the threat model here. A malicious CDN could do a lot more than logging privacy sensitive values. |
The argument that CH is a no-op seems pretty tenuous. CH takes a collection of things which can be requested by JS and can be put in a URI and puts all of them together in webserver logs. That's a worse privacy outcome. Suggesting that privacy-protecting browsers/extensions might block sending CH in the same way that they block other fingerprintable attributes just seems to admit that CH adds another vector for all the same badness we already have. The proposal seems to add yet another fingerprinting vector which is maybe only roughly as bad as all the others and can be blocked in the same way. Some people seem to be saying that's not a big deal. I think that's a bad idea and it shouldn't happen. The web platform ossifies privacy-harming functionality. In practice, it's very difficult for user-agents to defang established fingerprinting techniques. Adding more because they're only roughly as bad as existing ones (NB: I actually think CH is worse, because it's passive and likely to end up in logs) is the wrong direction. We should eliminate existing ways to track people, not add more new ones. |
Can you expand upon this? They aren't logged by default in any implementation I'm aware of; one would have to go out of their way to do so in all Web servers, proxies and CDNs that I've ever used. It's true that servers can go out of their way to log (and then misuse) this information. They can also do so if it's encoded into the URI, added to proprietary request headers, pinged to a server in a request body by active content, and so on. So I'm having a hard time believing that CH makes things more "likely to end up in logs". If a server, proxy or CDN is accidentally logging this information, they've got much bigger issues with handling sensitive information than just those introduced by CH. If a server is intentionally logging this information, the slight convenience afforded by CH (as opposed to current methods) doesn't seem like it's going to move the needle for them; if they want this information, they're going to get it anyway. To me, the more interesting difference is that in CH, the wire form is standardised. That cuts both ways; having a standard form makes it slightly more easy for a generic server to provide a facility to log / otherwise mangle the information; however, it also makes it easier to identify sensitive information for purposes of research / analysis / blocking / etc. |
Folks the argument that "middle parties / CDNs can do worse things, so no harm in giving them FP values too" is way off base. Again, an honest-but-curious vs malicious distinction. Saying "X party could do worse" doesn't justify making it easier to do other things easily! Among many other reasons, its possible to detect when CDNs misbehave now (at least in the content injecting ways discussed above). The proposal imagines enabling CDNs to trivially conduct a new type of misbehavior, in a way that cannot be detected
I don't think this is correct. There are a million bespoke ways this information could be encoded in a URL, some easy to pragmatically extract (e.g. query params), some not (e.g. packed into custom formatted blobs). In the status quo, an observer would need to come up with patterns to cover every conceivable way of packing these values into a URL string, and keep it updated every time an application changes patterns, etc. In a CH world, the values are always nicely formatted, in a consistent place, trivial to extract. This is what I mean by a quantity-to-the-point-of-quality problem. CH makes whats currently a difficult, constantly changing, not-generally-solveable problem into something trivial, exactly the kind of thing that could be trivially automated, aggregated and sold / shared / leaked. To put it a different way, given a set of 1m requests in a log, would you rather be the person in charge of coding up a system for extracting viewport-width, DPR, ECT, DeviceMemory, etc values from those logs in a pre, or a post, CH world. :)
Given the amount of literature documenting how often these end points are already abused, I don't think this counts as a "win". We already know people abuse this stuff! There is no win in making it easier to count the abuse; the win is in making it harder to conduct the abuse. |
@snyderp the argument you seem to be making can be generalised to "Let's not standardise any semantics for potentially sensitive data, because if there's a bad actor involved in handling that data, it makes their life easier." Does that capture it? |
My argument is stronger than "lets give them a difficult time" :) The argument is that CH turns a problem that could be "solved" for some cases with a great deal of manual effort, fragile rule generation, and ongoing maintenance (since URL -> FP value extraction rules would constantly be changing and need to be updated), into something that can be "solved" in all cases with |
I think the disconnect here may be in the relationship between CDNs and other "third parties" (e.g., contractors, hosting providers, data centres) and the origin (i.e., the responsible party). IME all of these relationships are highly coordinated and governed by a contract. Since they're coordinated, working together effect this sort of extraction currently is fairly trivial; CH doesn't significantly lower the bar. If the "third party" does this sort of thing without the consent of the origin, it's breaking the contract, which they have a strong incentive not to do. You seem to consider the contractual constraint there as inadequate. Is that closer? |
More importantly though, if the state of web privacy demonstrates anything, it is that there will quickly become parties that take advantage of (and monetize) any new FP attack surface. The list of FP techniques that initially seemed like "no one would actually do that", but then became widely deployed is long long long… (really long!). Adding FP values in CH headers will have the same outcome
Currently, there is no simple way for CDNs / middle parties to gain access to these FP values in a predictable, easy, consistent manner. CH would create a way for the middle parties to have predictable, easy, consistent access to FP values. Saying "they're both 1p so there is privacy loss" seems like papering over the plain truth of the situation.
|
That could be true, but Client Hints does not increase the fingerprinting attack surface
They can easily inject scripts that would send them those values in a predictable, easy, consistent manner. There are examples of CDNs injecting scripts as a premium service for analytics or content optimization purposes. Do you have any evidence of well-known CDNs injecting scripts today in order to fingerprint their customers' users without the customer's consent and active participation?
You keep making that unsubstantiated claim without any material evidence to back it up, after which you ask for evidence to the contrary. That's not how it typically works. |
The claim is (still) that CH makes it easier for middle parties to fingerprint users I claim this because, in the present, middle-parties can only fingerprint users in one of two ways:
The CH proposal would provide these values to middle parties in a way that makes finger printing easier and more common. With CH, middle parties could fingerprint users trivially (no need to guess at FP parameters from URLs, etc), passively (e.g. honest but curious attack scenario), and with a common solution (e.g. reading from the header). It seems we keep talking past each other. Maybe this could be more productive if we could narrow the conversation. Do you disagree with 1 or 2 above, or the conclusion I draw from them?
Surely it's the person proposing the change that could potentially harm the privacy of billions of people on the web who has the burden of proof! Not the person saying "this seems risky, lets make sure…" |
I think we need to establish consensus about the threat model for what you're calling "middle-parties" before we can mitigate concerns raised about them. AFAICT this is a very new argument. |
I worry that formalism will obscure the facts (which is that a class of party distinct from the origin will gain access to a new category of privacy sensitive information), but I'm happy to take a stab at it here if you think it will be helpful for the conversation (even if it may need several revisions to get right and tight). Here is a first stab: Middle parties are HTTPS terminators, like CDNs, outsourced reverse proxies, etc. They may have commitments to the origin, they do not have commitments to the client (including regarding privacy). While they may resort to malicious attacks (traffic tampering, etc) the primary concern is the honest-but-curious scenario, where they maximize the utility of the data they can gain w/o breaking protocol (i.e. they squeeze every $ out of every data point they can see, but don't modify traffic, inject JS, etc). I hope this helps @mnot . CH is privacy harmful because it increases the amount of fingerprinting middle parties can do, by taking something they can only occasionally do now (extracting FP parameters from URLs, by trying a variety of faulty, imperfect, non-generalizeable pattern matching strategies) and changing it into something they can do trivially (reading HTTP header values). |
Labeling TLS-terminating CDNs as a distinct security party from the server-which-is-behind-the-CDN seems problematic to me.
Re: the claim that headers are more likely to be logged, I too would like to see some data for this. It seems that if this is a concern, we should be equally or more concerned that CDNs are logging cookies (which are also sent as headers). The most convincing argument I see for not implementing CH at this time is that it adds complexity and another vector which must be blocked when the client blocks data that CH would otherwise send. (For instance, if a user blocks scripts or installs an extension to block certain fingerprinting vectors, the browser must make sure the data is correspondingly blocked in CH.) |
Recapping discussion from IETF 105..
Sparse but relevant minutes from the meeting. |
Given the discussion at IETF 105, can we close this? |
I believe so, I've not heard any followup or rebuttals since our discussions at 105. I'll close this out, if anyone disagrees with the outcome please feel free to reopen. |
I opened this issue, but was not at IETF. Could you kindly summarize the conclusion, and why the issue is being closed? Thanks! |
#767 (comment) provides a summary |
From PING:
We're concerned about the privacy implications of moving these attributes to header values, specifically since header values are more likely to wind up in passive / middle man / etc logs. Existing approaches require active techniques, and so (partially) reduce the fingerprinting risk.
The most on point issue I can find addressing this issue is #215, but this isn't quite on point (does not address increased risk from moving to passive collection).
I see the text added / modified in 2ba1998 that mentions that "implementors can do otherwise for privacy", but PING is uncomfortable with such text ( such text dissolve the point of the standard; a standard that says "its w/in this standard to vary arbitrarily", then all that is introduced is web compatibility problems for privacy oriented parties).
The text was updated successfully, but these errors were encountered: