Client-Hints exposes fingerprint values to additional parties and logging sensitive locations #767

pes10k · 2019-02-12T22:35:51Z

From PING:

We're concerned about the privacy implications of moving these attributes to header values, specifically since header values are more likely to wind up in passive / middle man / etc logs. Existing approaches require active techniques, and so (partially) reduce the fingerprinting risk.

The most on point issue I can find addressing this issue is #215, but this isn't quite on point (does not address increased risk from moving to passive collection).

I see the text added / modified in 2ba1998 that mentions that "implementors can do otherwise for privacy", but PING is uncomfortable with such text ( such text dissolve the point of the standard; a standard that says "its w/in this standard to vary arbitrarily", then all that is introduced is web compatibility problems for privacy oriented parties).

What discussion has been had regarding increased information leak into logs?
What measurements / data exists to suggest this is not a problem?

The text was updated successfully, but these errors were encountered:

mnot · 2019-02-12T23:46:03Z

HI @snyderp. Be aware that people participate in the IETF as individuals, not organisations; if people in PING (aside for other readers: this is the W3C Privacy Interest Group) want to engage here, they'll need to do so individually.

My understanding is that the current thinking around CH considers the requirement for a site to request a specific CH (in the form of the Accept-CH) as effectively "converting" passive collection to active collection; a researcher, for example, can measure how many sites are collecting such metrics, and a browser can alert the user when such collection is taking place.

There was a substantial amount of discussion around this, including in #215, #372 and elsewhere.

Are you questioning that, or are you only focusing on information leakage into logs?

Regarding the latter, can you illustrate how that would happen, and what the impact would be?

pes10k · 2019-02-13T01:11:37Z

@mnot thank you for your reply.

My understanding is that the current thinking around CH considers the requirement for a site to request a specific CH (in the form of the Accept-CH) as effectively "converting" passive collection to active collection; a researcher, for example, can measure how many sites are collecting such metrics, and a browser can alert the user when such collection is taking place.

Researchers and clients already do exactly the things you mentioned. CH doesn't help make preventing / measuring / notifying of FP & tracking any easier. I'm similarly not familiar (and don't see any related comment in CH tagged issues) to deprecate / remove the existing methods for retrieving the same information, so the responsibility / burden on privacy focused parities is strictly increased.

There was a substantial amount of discussion around this, including in #215, #372 and elsewhere.

I am aware of this conversation, i referenced a commit in mentioned in #215 above for example :) . But in general, im not sure I understand the connection to the above issues to the concern here, which is "values in headers get treated categorically differently, and persisted longer, than variables in JS" :)

Are you questioning that, or are you only focusing on information leakage into logs?
Regarding the latter, can you illustrate how that would happen, and what the impact would be?

Neither issue addresses the concerns regarding logging (and more broadly, that putting FP sensitive values in headers increases the risk of long term privacy leaks / tracking).

middle parties (CDN, proxies, other HTTP terminators) get access to data they would previously would not have had access too
that data is more likely to be persisted in long term logs than if its collected "actively" in JS
HTTP logs, that may now include these values, are often shared with third parties, furthering privacy risk.

I can go on :) But i hope this helps explain / motivate the concern further

yoavweiss · 2019-02-13T10:57:12Z

We're concerned about the privacy implications of moving these attributes to header values, specifically since header values are more likely to wind up in passive / middle man / etc logs.

I wish you would have raised any of those concerns in the previous meetings we had with the PING (e.g. in the F2F breakout session).

Existing approaches require active techniques, and so (partially) reduce the fingerprinting risk.

The common alternative to CH is origins inspecting those values in JS, and injecting them into URL query parameters. That practice seems significantly more "loggable", making Client Hints a clear win on that front.

The most on point issue I can find addressing this issue is #215, but this isn't quite on point (does not address increased risk from moving to passive collection).

Client Hints are not "passive collection". Getting the hints requires an opt-in, which as @mnot said, is making leak and abuse of that hint data something that can be tracked, monitored and being acted against. (by researchers, extensions or user agents)

I see the text added / modified in 2ba1998 that mentions that "implementors can do otherwise for privacy", but PING is uncomfortable with such text ( such text dissolve the point of the standard; a standard that says "its w/in this standard to vary arbitrarily", then all that is introduced is web compatibility problems for privacy oriented parties).

We are in the process of defining third-party delegation in the Fetch and HTML specifications. Perhaps the PING should focus on reviewing that, rather than the IETF draft language, which at least originally was destined for a broader audience, and therefore tends to be more vague.

What discussion has been had regarding increased information leak into logs?

Given that those concerns were not raised before in previous encounters with the PING or other privacy-minded folks, not a whole lot.

What measurements / data exists to suggest this is not a problem?

Since you're the one claiming that this is a significant issue, do you have data to suggest it is a problem?

Researchers and clients already do exactly the things you mentioned.

I don't think they can see what different services do in their backends with information provided to them by default. For example, with User-Agent strings, currently most UAs leak a lot of information by default, which makes its collection passive. With Client Hints, we intend to change that information leak model to make it so origins would have to opt-into that data, making that opt-in observable, measurable and actionable.

CH doesn't help make preventing / measuring / notifying of FP & tracking any easier.

It does, by turning information (e.g. the UA string or the Accept-Language headers) from "sent by default" to "sent only after the server expressed interest".

I'm similarly not familiar (and don't see any related comment in CH tagged issues) to deprecate / remove the existing methods for retrieving the same information, so the responsibility / burden on privacy focused parities is strictly increased.

If you're not familiar with it, feel free to ask :) @mikewest and I have presented such plans at the F2F TPAC breakout session, which I believe you attended. This is not mentioned in the IETF draft, as it's a feature that will use the CH infrastructure, and as such, is defined elsewhere.

I am aware of this conversation, i referenced a commit in mentioned in #215 above for example :) . But in general, im not sure I understand the connection to the above issues to the concern here, which is "values in headers get treated categorically differently, and persisted longer, than variables in JS" :)

Trackers rarely inspect fingerprinting data using JS APIs and then keep it on the device's memory. They typically send it to their servers, often as URL query parameters, which are arguably an order of magnitude more likely to be persisted in logs than HTTP headers.

middle parties (CDN, proxies, other HTTP terminators) get access to data they would previously would not have had access too

Since Client-Hints are exposed only over secure connections, I suppose you meant "TLS terminators" above?

CDNs do get access to that data, which enables them to perform their duties and use it for content negotiation purposes. They also have full access to the content as TLS terminators, so arguably could also have access to that data by running arbitrary JS on their customers sites. As CDNs are fully trusted by their customers not to do that unless the customer asks them to, I don't think they are the threat model here.

Also, anecdotally, I've never heard of a CDN that keeps around the logs of all their customers request and response headers. If they would, cookies are likely to present a bigger concern than fingerprinting bit leakage. But if you have examples to the contrary, I'd love to hear them.

As for other TLS terminators (e.g. MITM proxies), they can similarly inject arbitrary JS and leak arbitrary data. I doubt CH increases that attack surface.

that data is more likely to be persisted in long term logs than if its collected "actively" in JS

I'm not convinced that this is in fact the case. And again, acquiring the data in JS does not equate keeping that data in JS. It is leaked over the network, typically as URL params.

3. HTTP logs, that may now include these values, are often shared with third parties, furthering privacy risk.

Any specific examples of HTTP logs that contain all header values? I don't see Apache logs doing that, at least not by default. Similarly for Nginx, I don't see any headers there. I do see the URL though.

yadij · 2019-02-13T20:16:08Z

Squid proxy does have some logging options to record the HTTP headers in full or in part. At least some installations use those as their regular log format, or did so not long ago.

That said, having the CH values as opt-in instead of always present is a clear privacy improvement even for these installations. I do not agree with the argument that CH makes logging privacy/security issues worse.

jumde · 2019-02-14T19:00:50Z

@yoavweiss , @yadij: If I understand correctly, the client/user do not have any control over CH opt-in. If the server sends a Accept-CH header requesting DPR, ECT and DeviceMemory. The client has no way to emit all/specific values like DeviceMemory.

yadij · 2019-02-15T04:47:14Z

AIUI the opt-in has to be mutual. The server opts-in by indicating the CH and values it wants to see. The client opt-in is its choice whether to actually send that detail as requested.
The server should treat any absence of details the same as it would that detail being missing without CH.

For clients who literally cannot send a datum, there is no way they would have been able to send it regardless of whether CH is supported or not. So the server relying on its presence with a non-nil value is a server design bug.

pes10k · 2019-02-16T00:51:19Z

To respond to some of the issues mentioned above @yoavweiss @yadij :

Re: CH (does/does not) makes logging privacy issues worse

I'm having trouble understanding the claim here. I am going to see if I can repeat the thinking behind the argument that "CH aids FP prevention / detection" argument. If I am off base here, please kindly correct, I'm making an honest effort to get on the same page.

The arguments are:

1) Clients can say no easier

Right now sites use a bunch of JS values to determine things like what images to send
In a post-CH world, servers will expect clients to respond with headers, and so wont rely on JS values
Clients can easy detect when these CH values are being requested, and can choose not to respond
Therefor, CH allows clients to protect their privacy, since its easier to detect and say no, than the status quo

2) logging concerns are a push

When sites use JS values to determine these kinds of things, they are often put in URLs, and so wind up in logs already
Middle men can already inject JS

3) Depreciation of current JS end points

CH is a net win because some JS end points will be deprecated

Assuming this is a correct understanding of the argument, my concerns are the following

I am aware of the suggestion to deprecate UA access in JS (as you mentioned @yoavweiss we discussed it at the F2F), but I am not aware of any suggestion to depreciate the other related JS endpoints (and some of them, like viewport dimensions, seem impossible to hide in JS space). If there are suggestions to remove navigator.deviceMemory, DPR values, etc in some proposal somewhere, I would be very grateful for the link :)
I understand the claim that clients can say no to a CH request easily, but the likely scenario is that servers ask for the info in CH, and then if the client says no, they extract it in JS. This is the pattern we see for all sorts of tracking libs; the server / party / JS tries to get as many signals as possible, not just the highest level one. Adding the info to the CH layer strictly makes the situation no better, and in some cases worse (since, unless the client says no 100% of the time, values are exposed to parties that otherwise wouldn't have them).
Point taken that these values can already escape into some URLs. No question there. But there is just a world of difference between every website using its own, bespoke URL patterns, and a universal, well structured header field. There is no need for the client to be complicit and make things easier for the tracking party.
I appreciate the point that CDN / TLS terminating parties (not HTTP terminating, thank you for the correction :) ) can inject JS. The concern here is how much tracking information a curious but honest party can see, not a malicious party (since in that case, all bets are off already…). Its not a matter of the server trusting the CDN, the focus is client information winding up in more, easily extractable places.

Also, fwiw, for examples of logs that collect all header data, the mod_log_forensic module, on the website you linked too, is one example. Systems like snort and bro, want full HTTP header logs, etc. More to the point though, the expectation is that as more track-able information winds up in these headers, more parties will become interested in them. Such is the way of the web ¯_(ツ)_/¯

mnot · 2019-02-19T04:58:32Z

@snyderp it seems like you're arguing that CH should be a net improvement of privacy, as opposed to current practice. While that's a goal that most everyone here shares, it isn't an explicit criteria that we've been applying to the spec; rather, the bar that I think we've held (perhaps implicitly) is that it's no worse than current practice, in any meaningful way.

Is that the case? If so, it'd probably be best to have a discussion explicitly about the goals, so we can determine consensus and then consequences.

pes10k · 2019-02-19T20:39:46Z

Howdy @mnot: I am not arguing that CH should be a net improvement of privacy. My position is that CH is would result in something worse than current practice, for multiple reasons:

Third parties gain access to finger-printable values that they currently don't have access too (finger-printable values would now be in well structured CH headers and seen by CDNs / TLS terminators, where before they were either absent or ay worse sometimes indirectly available in unstructured URLs)
Finger-printable values are show up in locations where they are likely to persist for longer (in HTTP logs for above)
Finger-printable values are shared to first parties (and delegated parties) in more ways. Previously browsers needed to defend against JS based determinations, now they need to defend against JS based determinations AND header-extractions.

mnot · 2019-02-22T02:52:58Z

Thanks for the clarification.

I think you need to qualify (1). As discussed, CH is currently only sent over TLS, and so from the perspective of the web, is communication between two parties. It's true that some sites contract out some of their server capabilities to others (CDNs, cloud hosts, data centres, etc.) but those are generally still considered first-party interactions, because the entity in question is acting on behalf of the owner of the server, and that relationship is typically overseen by a legal agreement, as well as various rights and responsibilities in different jurisdictions.

Also, "third party" generally is used to refer to off-site services with a different origin; e.g., ads., so it's a bit confusing to use it here.

Terminology aside, there is some precedent for the argument you're making in (2), but generally when the community has been concerned about sensitive data being logged, it's been because it appears in URLs, which were (originally) designed to be written down, logged, etc. Extending that argument to include headers in an encrypted connection is new. That's not a roadblock, but it would help immensely to establish and get agreement upon the underlying principle.

For (3), I think the basis of the decision is likely to be whether the added functionality -- including privacy improvements, as we're seeing discussed in Mike West's User-Agent replacement discussion -- balances out the increased complexity. More information on both sides would probably help, although I do note that browsers haven't been terribly shy of complexity to date...

Are you likely to be coming to the IETF meeting in Prague? That's probably where we're going to try to move this (and similar issues) forward.

pes10k · 2019-02-22T23:06:49Z

@mnot Thank you for the reply.

I'm not sure I follow what you're asking for by "qualifying" the concern in (1) though. I'm happy to change terminology if it eases the conversation, but the concern seems pretty straight forward. User X wants to visit website Y who sits behind CDN (or similar) Z. Currently, if Z wants to extract these fingerprinting-sensitive values from X's conversation with Y, it requires some out of band communication with Y (access to Y's DB, or something like that, assuming Y is storing them). In a CH-world, Z gets well-structured access to those values, in a way they don't currently have. Thats a loss of privacy to X that does not currently exist.

Re (2) again, happy to contribute anything I can to the conversation here, but not sure what else can be added beyond the original concern. Is it helpful frame it as:

"privacy on the web, and in the growth of web standards, has been a demonstrable catastrophe. The first principal when proposing new standards should be "do no harm", or failing that, "here is an extremely compelling goal (e.g. not moderate perf improvements) that justifies further harm to privacy". Adding finger-printable values to HTTP headers, in such a way that they can be easily collected, aggregated and shared en mass, is a significant privacy harm.

(3) The concern here is not about implementation complexity. Just to try and reach some point of agreement, so that we can discuss further from there:

Setting aside the UA aspect, and dealing only with viewport height/width, DPR, ECT and DeviceMemory, etc: can we agree that CH is strictly a harm for privacy? In the best case (user agent declines the server's request for the headers, site falls back to JS based value extraction) its a push, and in all other cases its a loss (fingerprinting values are in more locations than they were before, and accessible by more parties).

If we don't agree to the above, i would greatly appreciate it if you could clarify by way of the following: what is a scenario where a) the server / website wants to get access to finger-printable values, and b) the client doesn't want to yield them, that CH helps improve privacy? (keeping in mind that a site that currently gets them through JS, will ask for the in CH, and if they don't get them, continue to ask for the in JS).

If I could better understand a case that fit the above scenario, it might help me much better understand where ya'll are coming from, and where I may be misunderstanding things :)

Unfortunately, I will not be able to join ya'll in Prague this year

yoavweiss · 2019-02-24T07:31:08Z

I'm not sure I follow what you're asking for by "qualifying" the concern in (1) though. I'm happy to change terminology if it eases the conversation, but the concern seems pretty straight forward. User X wants to visit website Y who sits behind CDN (or similar) Z. Currently, if Z wants to extract these fingerprinting-sensitive values from X's conversation with Y, it requires some out of band communication with Y (access to Y's DB, or something like that, assuming Y is storing them). In a CH-world, Z gets well-structured access to those values, in a way they don't currently have. Thats a loss of privacy to X that does not currently exist.

CDNs are delegates of the origin and are not considered part of the threat model. A rogue CDN (similar to a compromised server) can do way more damage than read out fingerprintable information.

Setting aside the UA aspect, and dealing only with viewport height/width, DPR, ECT and DeviceMemory, etc: can we agree that CH is strictly a harm for privacy?

We certainly can not agree on that! (and repeating that multiple times will not suddenly make it true)

We went through great lengths to make sure the current mechanism does not allow passive fingerprinting, and its use can be treated by UAs similarly to the equivalent JS APIs. Even if we just discuss the viewport-width, DPR, ECT and DeviceMemory, the fact that the information is communicated in a convenient and standard way does not mean that it's easier to exploit it compared to the JS APIs. OTOH, it makes it easier to never keep it in or scrub it from server/CDN logs (e.g. compared to the same information hidden as URL parameter conventions).

If we don't agree to the above, i would greatly appreciate it if you could clarify by way of the following: what is a scenario where a) the server / website wants to get access to finger-printable values, and b) the client doesn't want to yield them, that CH helps improve privacy? (keeping in mind that a site that currently gets them through JS, will ask for the in CH, and if they don't get them, continue to ask for the in JS).

If I could better understand a case that fit the above scenario, it might help me much better understand where ya'll are coming from, and where I may be misunderstanding things :)

In your scenario above, how do Client Hints make things worse? If the information is exposed through JS, it is exposed. Making it available through another channel doesn't add any new fingerprintable data that attackers can abuse.

mnot · 2019-02-25T06:45:48Z

Unfortunately, I will not be able to join ya'll in Prague this year

Keep in mind, there is remote participation.

pes10k · 2019-02-25T23:39:59Z

@mnot I would be very happy to participate in this conversation remotely when its happening in Prague. Any information you could link to / share about how to participate would be greatly appreciated. I am new to the mysterious world of IETF :)

mnot · 2019-02-25T23:47:45Z

@snyderp welcome :)

General resources listed here:
https://www.ietf.org/about/participate/

Meeting details here:
https://www.ietf.org/how/meetings/104/

Best thing to do is to register for remote participation (free), and watch the agenda (linked from the meeting details above) for links to the audio and video feeds (should be added when the agenda is final). Jabber is also a primary communication channel during the meeting; [email protected].

pes10k · 2019-02-26T00:20:27Z

CDNs are delegates of the origin and are not considered part of the threat model. A rogue CDN (similar to a compromised server) can do way more damage than read out fingerprintable information.

This is addressing the concern by defining it away. If they're not part of the threat model, then they certainly aught to be. They're distinct parties, with potentially distinct (if any) commitments to the visitor, etc.

To say a CDN could do something worse is unrelated; sure they could. But that is unrelated to whether it is a good idea to provide them with the data they need to easily finger print users! The relevant framework here is an honest-and-curious vs malicious distinction.

We went through great lengths to make sure the current mechanism does not allow passive fingerprinting, and its use can be treated by UAs similarly to the equivalent JS APIs.

Just to lower the temperature in the room: I appreciate and don't mean to denigrate how much effort you've put into this. I'm sure its a lot of work, and I'm sincerely grateful folks like you are working hard to find ways to improve the web. :) But it doesn't change the privacy harm in CH as it stands…

Even if we just discuss the viewport-width, DPR, ECT and DeviceMemory, the fact that the information is communicated in a convenient and standard way does not mean that it's easier to exploit it compared to the JS APIs. OTOH, it makes it easier to never keep it in or scrub it from server/CDN logs (e.g. compared to the same information hidden as URL parameter conventions).

More examples:

User disables JS as a tracking counter measure (e.g. no script). A browser implementing CH will end up still leaking FP values
User has no-op'ed the JS api's in question to reduce tracking (e.g. privacy badger). A browser implementing CH will end up still leaking FP values
User blocks execution of just JS resources to avoid leaking the above vales. A browser implementing CH will end up still leaking those FP values on non-JS resource fetches
etc.
User finds it okay to share the FP fields to the origin (and conceivably, even the 3p), but doesn't want to give CDN's and similar convenient access to all of the above. They might doubly want to reduce the chance that the data sticks around for a long time in a log). A browser implementing CH ends up advertising these FP values to anyone between the client and the server.

I understand and take your point that there are cases where a CDN or similar could guess at some of these values from some domains buy guessing at URLs, but again, there is a quantity-to-the-point-of-quality difference between CDNs using some manually curated set of per-domain-per-application-version regexs, and providing that data in a structured header field.

In your scenario above, how do Client Hints make things worse? If the information is exposed through JS, it is exposed. Making it available through another channel doesn't add any new fingerprintable data that attackers can abuse.

They make it worse by increasing the attack surface the client needs to defend against. I hope the above examples help explain. Also, making the same data in multiple places (and a different set of parties), making it much easier for middle parties to log and preserve the FP values, and relying on browsers to deploy additional countermeasure / standards-deviations to avoid further privacy harm are all examples of additional privacy risk / loss in the standard.

But happy to have reached a point where we're at least not arguing that CH somehow reduces FP surface: (e.g. #767 (comment)) 😀

pes10k · 2019-02-26T00:45:01Z

sorry to spam the thread but this has grown to three related, but distinctl concerns:

FP values winding up in places where they're likely to be long term persisted (e.g. CDN logs and similar)
A general growth in the number of parties who will have access to FP values, who didn't before
That (whether?) its more of a privacy leak to provide middle-parties access in FP values in HTTP Headers vs. possibly extracting from URLs

I'm happy to keep them all in this thread, but if folks would like me to spread out to different issues, i'm happy to do so too

yadij · 2019-02-26T05:12:31Z

yoavweiss :

CDNs are delegates of the origin and are not considered part of the threat model. A rogue CDN (similar to a compromised server) can do way more damage than read out fingerprintable information.

snyderp:

This is addressing the concern by defining it away. If they're not part of the threat model, then they certainly aught to be. They're distinct parties, with potentially distinct (if any) commitments to the visitor, etc.

I think what Yoav should have said is that CDN are considered part of the origin. So there is no separate model. All privacy and security aspects for "origin" also apply to CDN.

You keep stating that CH expands the exposure. But look at the exposure scopes in the threat matrix I list below.

snyderp :

But happy to have reached a point where we're at least not arguing that CH somehow reduces FP surface: (e.g. #767 (comment)) grinning

That was a statement by me. AFAIK yoav has never had that position.
FTR my position remains that CH (alone) has better privacy than the status quo (JS hacks). Consider these two scenarios:

scenario 1: a request arrives for some random URL. This URL is stored, logged, and passed around. The response to this request has a response lacking cacheability headers. For performance vague responses are cached. Due to the permutability of query-string values there may be N copies of this URL+object stored in M caches around the world - for potentially 68 year long timespans.
=> even with HTTPS protection the FP data can be retrieved from any of these intermediary data sources long after the transaction is over.

scenario 2: a request arrives with C-H header details. The URL is clean - so no danger from logging and passing that around within the intermediary system. The request C-H headers (being request headers) are not cached with the response (if they are used by Vary/Key etc it is in the form of a crypto hash).
=> the clients FP data is only ever in memory during the transaction active period, is never seen by most of the intermediary system components, and cannot be recovered from the long-term storage.

So tell me again how scenario #1 is better for privacy?

synderp :

In the best case (user agent declines the server's request for the headers, site falls back to JS based value extraction)

That is actually a case near worst-case.

The exposure matrix is a 2x3 [ [JS, CH], [send, fallback, omit] ]. So these:

JS-only (the status quo):

FP exposure to all HTTP agents along the path
FP exposure to all log processors
FP exposure to all filesystem agents
FP exposure to other networking services on intermediary host
FP exposure to any services the above leak URL and/or cache data to
expected persistent FP exposure to all the above for up to 68 years

CH-only

FP data only by request.
conclusion: less FP data, therefore less exposure than status quo.
FP exposure to all HTTP agents along the path
conclusion: less agent types, therefore less exposure than status quo.

CH with JS fallback

client is forwarned of FP actions so can proactively close off JS APIs
bias toward CH-only exposure, with risks of status quo amounts of FP exposure.
conclusion: less or equal FP exposure to status quo.

JS with CH fallback

bias towards status quo FP exposure.
conclusion: equal FP exposure to status quo.

Both CH and JS data

CH exposure is a sub-set of JS exposure.
conclusion: equal FP exposure to status quo.

Neither

no FP exposure within scope of the model.

One may argue that agents without JS support are being added to the exposure set. However I counter that the FP data is already visible to such agents in the form of URL values. The presence of FP data in URLs is already where the worst types of leaks are occuring with the JS-only approach. Simply closing off those major avenues of exposure is the reduction in surface I referred to earlier and still believe is offered by CH.

inian · 2019-02-26T20:51:48Z

User disables JS as a tracking counter measure (e.g. no script). A browser implementing CH will end up still leaking FP values

User has no-op'ed the JS api's in question to reduce tracking (e.g. privacy badger). A browser implementing CH will end up still leaking FP values

User blocks execution of just JS resources to avoid leaking the above vales. A browser implementing CH will end up still leaking those FP values on non-JS resource fetches
etc.

Similar to how the user can block scripts from running in his browser, he could block the CH headers from being sent as well (and it is not significantly harder to do so). These privacy protecting extensions like noscript, privacy badger, etc. could be updated to remove Client Hints from non-origin requests. If the user is willing to trade off performance for privacy, he has the choice to do so that way. Privacy focused extensions (or browsers) would have a safe way to block this than trying to strip off query parameters off a URL for example.

I also agree that CDNs in front of the origin shouldn't be considered as part of the threat model here. A malicious CDN could do a lot more than logging privacy sensitive values.

tildelowengrimm · 2019-02-26T23:09:45Z

The argument that CH is a no-op seems pretty tenuous. CH takes a collection of things which can be requested by JS and can be put in a URI and puts all of them together in webserver logs. That's a worse privacy outcome. Suggesting that privacy-protecting browsers/extensions might block sending CH in the same way that they block other fingerprintable attributes just seems to admit that CH adds another vector for all the same badness we already have.

The proposal seems to add yet another fingerprinting vector which is maybe only roughly as bad as all the others and can be blocked in the same way. Some people seem to be saying that's not a big deal. I think that's a bad idea and it shouldn't happen.

The web platform ossifies privacy-harming functionality. In practice, it's very difficult for user-agents to defang established fingerprinting techniques. Adding more because they're only roughly as bad as existing ones (NB: I actually think CH is worse, because it's passive and likely to end up in logs) is the wrong direction. We should eliminate existing ways to track people, not add more new ones.

mnot · 2019-02-26T23:54:32Z

CH takes a collection of things which can be requested by JS and can be put in a URI and puts all of them together in webserver logs.

Can you expand upon this? They aren't logged by default in any implementation I'm aware of; one would have to go out of their way to do so in all Web servers, proxies and CDNs that I've ever used.

It's true that servers can go out of their way to log (and then misuse) this information. They can also do so if it's encoded into the URI, added to proprietary request headers, pinged to a server in a request body by active content, and so on.

So I'm having a hard time believing that CH makes things more "likely to end up in logs". If a server, proxy or CDN is accidentally logging this information, they've got much bigger issues with handling sensitive information than just those introduced by CH. If a server is intentionally logging this information, the slight convenience afforded by CH (as opposed to current methods) doesn't seem like it's going to move the needle for them; if they want this information, they're going to get it anyway.

To me, the more interesting difference is that in CH, the wire form is standardised. That cuts both ways; having a standard form makes it slightly more easy for a generic server to provide a facility to log / otherwise mangle the information; however, it also makes it easier to identify sensitive information for purposes of research / analysis / blocking / etc.

pes10k · 2019-02-27T00:07:27Z

Folks the argument that "middle parties / CDNs can do worse things, so no harm in giving them FP values too" is way off base. Again, an honest-but-curious vs malicious distinction. Saying "X party could do worse" doesn't justify making it easier to do other things easily!

Among many other reasons, its possible to detect when CDNs misbehave now (at least in the content injecting ways discussed above). The proposal imagines enabling CDNs to trivially conduct a new type of misbehavior, in a way that cannot be detected

It's true that servers can go out of their way to log (and then misuse) this information. They can also do so if it's encoded into the URI, added to proprietary request headers, pinged to a server in a request body by active content, and so on.

I don't think this is correct. There are a million bespoke ways this information could be encoded in a URL, some easy to pragmatically extract (e.g. query params), some not (e.g. packed into custom formatted blobs). In the status quo, an observer would need to come up with patterns to cover every conceivable way of packing these values into a URL string, and keep it updated every time an application changes patterns, etc. In a CH world, the values are always nicely formatted, in a consistent place, trivial to extract. This is what I mean by a quantity-to-the-point-of-quality problem. CH makes whats currently a difficult, constantly changing, not-generally-solveable problem into something trivial, exactly the kind of thing that could be trivially automated, aggregated and sold / shared / leaked.

To put it a different way, given a set of 1m requests in a log, would you rather be the person in charge of coding up a system for extracting viewport-width, DPR, ECT, DeviceMemory, etc values from those logs in a pre, or a post, CH world. :)

however, it also makes it easier to identify sensitive information for purposes of research / analysis / blocking / etc.

Given the amount of literature documenting how often these end points are already abused, I don't think this counts as a "win". We already know people abuse this stuff! There is no win in making it easier to count the abuse; the win is in making it harder to conduct the abuse.

mnot · 2019-02-27T00:13:20Z

@snyderp the argument you seem to be making can be generalised to "Let's not standardise any semantics for potentially sensitive data, because if there's a bad actor involved in handling that data, it makes their life easier."

Does that capture it?

pes10k · 2019-02-27T00:27:58Z

My argument is stronger than "lets give them a difficult time" :)

The argument is that CH turns a problem that could be "solved" for some cases with a great deal of manual effort, fragile rule generation, and ongoing maintenance (since URL -> FP value extraction rules would constantly be changing and need to be updated), into something that can be "solved" in all cases with echo $_SERVER["<whatever>"];

mnot · 2019-02-27T00:36:29Z

I think the disconnect here may be in the relationship between CDNs and other "third parties" (e.g., contractors, hosting providers, data centres) and the origin (i.e., the responsible party). IME all of these relationships are highly coordinated and governed by a contract.

Since they're coordinated, working together effect this sort of extraction currently is fairly trivial; CH doesn't significantly lower the bar. If the "third party" does this sort of thing without the consent of the origin, it's breaking the contract, which they have a strong incentive not to do.

You seem to consider the contractual constraint there as inadequate. Is that closer?

pes10k · 2019-03-04T22:21:55Z

To the degree there is a contractual constraint[1], its not going to be known to the client; it would be irresponsible for the browser to assume "the origin has made the middle party(s) promise to protect my privacy".

More importantly though, if the state of web privacy demonstrates anything, it is that there will quickly become parties that take advantage of (and monetize) any new FP attack surface. The list of FP techniques that initially seemed like "no one would actually do that", but then became widely deployed is long long long… (really long!). Adding FP values in CH headers will have the same outcome

I think you're overstating the degree to which the origin ~= the hosting provider ~= CDN. In some situations they may make sense to collapse into "one party". But thats not the case here; they have access to distinct amounts of information, and CH unambiguously expands the number of parties that have access to privacy-harmful information.

Currently, there is no simple way for CDNs / middle parties to gain access to these FP values in a predictable, easy, consistent manner. CH would create a way for the middle parties to have predictable, easy, consistent access to FP values.

Saying "they're both 1p so there is privacy loss" seems like papering over the plain truth of the situation.

if there is a survey of the promises CDNs / middle parties / the like make about client privacy, I would be extremely interested in it. It doesn't change the fact that sending FP values in CH headers is harmful to web privacy, but it would be extremely interesting to read either way. If you know of such a document, I'd be grateful for a link.

yoavweiss · 2019-03-11T11:00:17Z

if the state of web privacy demonstrates anything, it is that there will quickly become parties that take advantage of (and monetize) any new FP attack surface

That could be true, but Client Hints does not increase the fingerprinting attack surface

Currently, there is no simple way for CDNs / middle parties to gain access to these FP values in a predictable, easy, consistent manner

They can easily inject scripts that would send them those values in a predictable, easy, consistent manner. There are examples of CDNs injecting scripts as a premium service for analytics or content optimization purposes. Do you have any evidence of well-known CDNs injecting scripts today in order to fingerprint their customers' users without the customer's consent and active participation?

It doesn't change the fact that sending FP values in CH headers is harmful to web privacy

You keep making that unsubstantiated claim without any material evidence to back it up, after which you ask for evidence to the contrary. That's not how it typically works.

pes10k · 2019-03-11T19:28:48Z

That could be true, but Client Hints does not increase the fingerprinting attack surface

The claim is (still) that CH makes it easier for middle parties to fingerprint users

I claim this because, in the present, middle-parties can only fingerprint users in one of two ways:

trying to extract parameters from URLs (error prone, difficult and not possible with a general solution), or
an active attack / traffic tampering (injecting JS, which clients can try to defend against)

The CH proposal would provide these values to middle parties in a way that makes finger printing easier and more common. With CH, middle parties could fingerprint users trivially (no need to guess at FP parameters from URLs, etc), passively (e.g. honest but curious attack scenario), and with a common solution (e.g. reading from the header).

It seems we keep talking past each other. Maybe this could be more productive if we could narrow the conversation. Do you disagree with 1 or 2 above, or the conclusion I draw from them?

You keep making that unsubstantiated claim without any material evidence to back it up, after which you ask for evidence to the contrary. That's not how it typically works.

Surely it's the person proposing the change that could potentially harm the privacy of billions of people on the web who has the burden of proof! Not the person saying "this seems risky, lets make sure…"

mnot · 2019-03-12T04:48:45Z

I claim this because, in the present, middle-parties can only fingerprint users

I think we need to establish consensus about the threat model for what you're calling "middle-parties" before we can mitigate concerns raised about them. AFAICT this is a very new argument.

pes10k · 2019-03-12T18:39:22Z

I worry that formalism will obscure the facts (which is that a class of party distinct from the origin will gain access to a new category of privacy sensitive information), but I'm happy to take a stab at it here if you think it will be helpful for the conversation (even if it may need several revisions to get right and tight).

Here is a first stab:

Middle parties are HTTPS terminators, like CDNs, outsourced reverse proxies, etc. They may have commitments to the origin, they do not have commitments to the client (including regarding privacy). While they may resort to malicious attacks (traffic tampering, etc) the primary concern is the honest-but-curious scenario, where they maximize the utility of the data they can gain w/o breaking protocol (i.e. they squeeze every $ out of every data point they can see, but don't modify traffic, inject JS, etc).

I hope this helps @mnot . CH is privacy harmful because it increases the amount of fingerprinting middle parties can do, by taking something they can only occasionally do now (extracting FP parameters from URLs, by trying a variety of faulty, imperfect, non-generalizeable pattern matching strategies) and changing it into something they can do trivially (reading HTTP header values).

diracdeltas · 2019-04-13T02:38:21Z

Middle parties are HTTPS terminators, like CDNs, outsourced reverse proxies, etc

Labeling TLS-terminating CDNs as a distinct security party from the server-which-is-behind-the-CDN seems problematic to me.

Clients have no way of distinguishing whether they are talking to a TLS-terminating CDN or a non-CDN'ed site, so this essentially proposes a security boundary which cannot be enforced by clients. This means that any information which should not be revealed to these "middle parties" must be denied for all parties.
There is nothing in any browser's UX that I know of which indicates to the user whether they are talking to a TLS-terminating CDN versus the site's server directly. Even if (1) were possible, this seems like a tricky concept to communicate to users.
Cookies are sent as a header and are usually more of a privacy/security risk than the client-hints header. Does this mean we should deprecate cookies in favor of JS-accessible state storage mechanisms like localStorage?

Re: the claim that headers are more likely to be logged, I too would like to see some data for this. It seems that if this is a concern, we should be equally or more concerned that CDNs are logging cookies (which are also sent as headers).

The most convincing argument I see for not implementing CH at this time is that it adds complexity and another vector which must be blocked when the client blocks data that CH would otherwise send. (For instance, if a user blocks scripts or installs an extension to block certain fingerprinting vectors, the browser must make sure the data is correspondingly blocked in CH.)

igrigorik · 2019-07-25T18:52:59Z

Recapping discussion from IETF 105..

The group does not consider the CDN as adversarial threat model
The CDN can, sometimes, be part of accidental threat model (misconfiguration, etc)
We should indicate in security considerations that client hints might carry sensitive information and that they should be treated with care — WIP Add Sec- prefix to security considerations #776 PR is aiming to capture this.

Sparse but relevant minutes from the meeting.

yoavweiss · 2019-11-19T10:16:53Z

Given the discussion at IETF 105, can we close this?

igrigorik · 2019-11-19T15:51:57Z

I believe so, I've not heard any followup or rebuttals since our discussions at 105. I'll close this out, if anyone disagrees with the outcome please feel free to reopen.

pes10k · 2019-11-19T19:18:11Z

I opened this issue, but was not at IETF. Could you kindly summarize the conclusion, and why the issue is being closed? Thanks!

yoavweiss · 2019-11-19T21:32:40Z

#767 (comment) provides a summary

pes10k mentioned this issue Feb 12, 2019

Data motivating CH? #768

Closed

jumde mentioned this issue Mar 1, 2019

Disable Client-Hints in brave brave/brave-browser#3539

Closed

mnot added the client-hints label Mar 12, 2019

pes10k changed the title ~~CH, Logging and passive tracking / fingerprinting~~ Client-Hints exposes fingerprint values to additional parties and logging sensitive locations Apr 8, 2019

pes10k mentioned this issue Apr 8, 2019

Active vs. passive fingerprinting #786

Closed

jumde mentioned this issue Apr 12, 2019

Issue 3539: Disabling client hints brave/brave-core#2205

Merged

19 tasks

diracdeltas mentioned this issue Apr 17, 2019

Client Hints brave-experiments/standards-positions#1

Open

yoavweiss mentioned this issue Jul 25, 2019

Add Sec- prefix to security considerations #776

Merged

igrigorik closed this as completed Nov 19, 2019

ronancremin mentioned this issue Feb 14, 2020

Partial freezing of the User-Agent string w3ctag/design-reviews#467

Closed

1 task

ronancremin mentioned this issue Mar 18, 2021

Missing evidence for core problem addressed by specification WICG/ua-client-hints#215

Open

Client-Hints exposes fingerprint values to additional parties and logging sensitive locations #767

Client-Hints exposes fingerprint values to additional parties and logging sensitive locations #767

Comments

pes10k commented Feb 12, 2019

mnot commented Feb 12, 2019

pes10k commented Feb 13, 2019

yoavweiss commented Feb 13, 2019

yadij commented Feb 13, 2019

jumde commented Feb 14, 2019

yadij commented Feb 15, 2019

pes10k commented Feb 16, 2019

1) Clients can say no easier

2) logging concerns are a push

3) Depreciation of current JS end points

mnot commented Feb 19, 2019

pes10k commented Feb 19, 2019

mnot commented Feb 22, 2019

pes10k commented Feb 22, 2019

yoavweiss commented Feb 24, 2019

mnot commented Feb 25, 2019

pes10k commented Feb 25, 2019

mnot commented Feb 25, 2019

pes10k commented Feb 26, 2019 • edited Loading

pes10k commented Feb 26, 2019

yadij commented Feb 26, 2019

inian commented Feb 26, 2019

tildelowengrimm commented Feb 26, 2019

mnot commented Feb 26, 2019

pes10k commented Feb 27, 2019

mnot commented Feb 27, 2019

pes10k commented Feb 27, 2019

mnot commented Feb 27, 2019

pes10k commented Mar 4, 2019

yoavweiss commented Mar 11, 2019 • edited Loading

pes10k commented Mar 11, 2019

mnot commented Mar 12, 2019

pes10k commented Mar 12, 2019

diracdeltas commented Apr 13, 2019

igrigorik commented Jul 25, 2019

yoavweiss commented Nov 19, 2019

igrigorik commented Nov 19, 2019

pes10k commented Nov 19, 2019

yoavweiss commented Nov 19, 2019

pes10k commented Feb 26, 2019 •

edited

Loading

yoavweiss commented Mar 11, 2019 •

edited

Loading