Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client-Hints exposes fingerprint values to additional parties and logging sensitive locations #767

Closed
pes10k opened this issue Feb 12, 2019 · 36 comments

Comments

@pes10k
Copy link

pes10k commented Feb 12, 2019

From PING:

We're concerned about the privacy implications of moving these attributes to header values, specifically since header values are more likely to wind up in passive / middle man / etc logs. Existing approaches require active techniques, and so (partially) reduce the fingerprinting risk.

The most on point issue I can find addressing this issue is #215, but this isn't quite on point (does not address increased risk from moving to passive collection).

I see the text added / modified in 2ba1998 that mentions that "implementors can do otherwise for privacy", but PING is uncomfortable with such text ( such text dissolve the point of the standard; a standard that says "its w/in this standard to vary arbitrarily", then all that is introduced is web compatibility problems for privacy oriented parties).

  • What discussion has been had regarding increased information leak into logs?
  • What measurements / data exists to suggest this is not a problem?
@mnot
Copy link
Member

mnot commented Feb 12, 2019

HI @snyderp. Be aware that people participate in the IETF as individuals, not organisations; if people in PING (aside for other readers: this is the W3C Privacy Interest Group) want to engage here, they'll need to do so individually.

My understanding is that the current thinking around CH considers the requirement for a site to request a specific CH (in the form of the Accept-CH) as effectively "converting" passive collection to active collection; a researcher, for example, can measure how many sites are collecting such metrics, and a browser can alert the user when such collection is taking place.

There was a substantial amount of discussion around this, including in #215, #372 and elsewhere.

Are you questioning that, or are you only focusing on information leakage into logs?

Regarding the latter, can you illustrate how that would happen, and what the impact would be?

@pes10k
Copy link
Author

pes10k commented Feb 13, 2019

@mnot thank you for your reply.

My understanding is that the current thinking around CH considers the requirement for a site to request a specific CH (in the form of the Accept-CH) as effectively "converting" passive collection to active collection; a researcher, for example, can measure how many sites are collecting such metrics, and a browser can alert the user when such collection is taking place.

Researchers and clients already do exactly the things you mentioned. CH doesn't help make preventing / measuring / notifying of FP & tracking any easier. I'm similarly not familiar (and don't see any related comment in CH tagged issues) to deprecate / remove the existing methods for retrieving the same information, so the responsibility / burden on privacy focused parities is strictly increased.

There was a substantial amount of discussion around this, including in #215, #372 and elsewhere.

I am aware of this conversation, i referenced a commit in mentioned in #215 above for example :) . But in general, im not sure I understand the connection to the above issues to the concern here, which is "values in headers get treated categorically differently, and persisted longer, than variables in JS" :)

Are you questioning that, or are you only focusing on information leakage into logs?
Regarding the latter, can you illustrate how that would happen, and what the impact would be?

Neither issue addresses the concerns regarding logging (and more broadly, that putting FP sensitive values in headers increases the risk of long term privacy leaks / tracking).

  1. middle parties (CDN, proxies, other HTTP terminators) get access to data they would previously would not have had access too
  2. that data is more likely to be persisted in long term logs than if its collected "actively" in JS
  3. HTTP logs, that may now include these values, are often shared with third parties, furthering privacy risk.

I can go on :) But i hope this helps explain / motivate the concern further

@yoavweiss
Copy link
Contributor

We're concerned about the privacy implications of moving these attributes to header values, specifically since header values are more likely to wind up in passive / middle man / etc logs.

I wish you would have raised any of those concerns in the previous meetings we had with the PING (e.g. in the F2F breakout session).

Existing approaches require active techniques, and so (partially) reduce the fingerprinting risk.

The common alternative to CH is origins inspecting those values in JS, and injecting them into URL query parameters. That practice seems significantly more "loggable", making Client Hints a clear win on that front.

The most on point issue I can find addressing this issue is #215, but this isn't quite on point (does not address increased risk from moving to passive collection).

Client Hints are not "passive collection". Getting the hints requires an opt-in, which as @mnot said, is making leak and abuse of that hint data something that can be tracked, monitored and being acted against. (by researchers, extensions or user agents)

I see the text added / modified in 2ba1998 that mentions that "implementors can do otherwise for privacy", but PING is uncomfortable with such text ( such text dissolve the point of the standard; a standard that says "its w/in this standard to vary arbitrarily", then all that is introduced is web compatibility problems for privacy oriented parties).

We are in the process of defining third-party delegation in the Fetch and HTML specifications. Perhaps the PING should focus on reviewing that, rather than the IETF draft language, which at least originally was destined for a broader audience, and therefore tends to be more vague.

  • What discussion has been had regarding increased information leak into logs?

Given that those concerns were not raised before in previous encounters with the PING or other privacy-minded folks, not a whole lot.

  • What measurements / data exists to suggest this is not a problem?

Since you're the one claiming that this is a significant issue, do you have data to suggest it is a problem?

Researchers and clients already do exactly the things you mentioned.

I don't think they can see what different services do in their backends with information provided to them by default. For example, with User-Agent strings, currently most UAs leak a lot of information by default, which makes its collection passive. With Client Hints, we intend to change that information leak model to make it so origins would have to opt-into that data, making that opt-in observable, measurable and actionable.

CH doesn't help make preventing / measuring / notifying of FP & tracking any easier.

It does, by turning information (e.g. the UA string or the Accept-Language headers) from "sent by default" to "sent only after the server expressed interest".

I'm similarly not familiar (and don't see any related comment in CH tagged issues) to deprecate / remove the existing methods for retrieving the same information, so the responsibility / burden on privacy focused parities is strictly increased.

If you're not familiar with it, feel free to ask :) @mikewest and I have presented such plans at the F2F TPAC breakout session, which I believe you attended. This is not mentioned in the IETF draft, as it's a feature that will use the CH infrastructure, and as such, is defined elsewhere.

I am aware of this conversation, i referenced a commit in mentioned in #215 above for example :) . But in general, im not sure I understand the connection to the above issues to the concern here, which is "values in headers get treated categorically differently, and persisted longer, than variables in JS" :)

Trackers rarely inspect fingerprinting data using JS APIs and then keep it on the device's memory. They typically send it to their servers, often as URL query parameters, which are arguably an order of magnitude more likely to be persisted in logs than HTTP headers.

  1. middle parties (CDN, proxies, other HTTP terminators) get access to data they would previously would not have had access too

Since Client-Hints are exposed only over secure connections, I suppose you meant "TLS terminators" above?

CDNs do get access to that data, which enables them to perform their duties and use it for content negotiation purposes. They also have full access to the content as TLS terminators, so arguably could also have access to that data by running arbitrary JS on their customers sites. As CDNs are fully trusted by their customers not to do that unless the customer asks them to, I don't think they are the threat model here.

Also, anecdotally, I've never heard of a CDN that keeps around the logs of all their customers request and response headers. If they would, cookies are likely to present a bigger concern than fingerprinting bit leakage. But if you have examples to the contrary, I'd love to hear them.

As for other TLS terminators (e.g. MITM proxies), they can similarly inject arbitrary JS and leak arbitrary data. I doubt CH increases that attack surface.

  • that data is more likely to be persisted in long term logs than if its collected "actively" in JS

I'm not convinced that this is in fact the case. And again, acquiring the data in JS does not equate keeping that data in JS. It is leaked over the network, typically as URL params.

3. HTTP logs, that may now include these values, are often shared with third parties, furthering privacy risk.

Any specific examples of HTTP logs that contain all header values? I don't see Apache logs doing that, at least not by default. Similarly for Nginx, I don't see any headers there. I do see the URL though.

@yadij
Copy link

yadij commented Feb 13, 2019

Squid proxy does have some logging options to record the HTTP headers in full or in part. At least some installations use those as their regular log format, or did so not long ago.

That said, having the CH values as opt-in instead of always present is a clear privacy improvement even for these installations. I do not agree with the argument that CH makes logging privacy/security issues worse.

@jumde
Copy link

jumde commented Feb 14, 2019

@yoavweiss , @yadij: If I understand correctly, the client/user do not have any control over CH opt-in. If the server sends a Accept-CH header requesting DPR, ECT and DeviceMemory. The client has no way to emit all/specific values like DeviceMemory.

@yadij
Copy link

yadij commented Feb 15, 2019

AIUI the opt-in has to be mutual. The server opts-in by indicating the CH and values it wants to see. The client opt-in is its choice whether to actually send that detail as requested.
The server should treat any absence of details the same as it would that detail being missing without CH.

For clients who literally cannot send a datum, there is no way they would have been able to send it regardless of whether CH is supported or not. So the server relying on its presence with a non-nil value is a server design bug.

@pes10k
Copy link
Author

pes10k commented Feb 16, 2019

To respond to some of the issues mentioned above @yoavweiss @yadij :

Re: CH (does/does not) makes logging privacy issues worse

I'm having trouble understanding the claim here. I am going to see if I can repeat the thinking behind the argument that "CH aids FP prevention / detection" argument. If I am off base here, please kindly correct, I'm making an honest effort to get on the same page.

The arguments are:

1) Clients can say no easier

  • Right now sites use a bunch of JS values to determine things like what images to send
  • In a post-CH world, servers will expect clients to respond with headers, and so wont rely on JS values
  • Clients can easy detect when these CH values are being requested, and can choose not to respond
  • Therefor, CH allows clients to protect their privacy, since its easier to detect and say no, than the status quo

2) logging concerns are a push

  • When sites use JS values to determine these kinds of things, they are often put in URLs, and so wind up in logs already
  • Middle men can already inject JS

3) Depreciation of current JS end points

  • CH is a net win because some JS end points will be deprecated

Assuming this is a correct understanding of the argument, my concerns are the following

  • I am aware of the suggestion to deprecate UA access in JS (as you mentioned @yoavweiss we discussed it at the F2F), but I am not aware of any suggestion to depreciate the other related JS endpoints (and some of them, like viewport dimensions, seem impossible to hide in JS space). If there are suggestions to remove navigator.deviceMemory, DPR values, etc in some proposal somewhere, I would be very grateful for the link :)
  • I understand the claim that clients can say no to a CH request easily, but the likely scenario is that servers ask for the info in CH, and then if the client says no, they extract it in JS. This is the pattern we see for all sorts of tracking libs; the server / party / JS tries to get as many signals as possible, not just the highest level one. Adding the info to the CH layer strictly makes the situation no better, and in some cases worse (since, unless the client says no 100% of the time, values are exposed to parties that otherwise wouldn't have them).
  • Point taken that these values can already escape into some URLs. No question there. But there is just a world of difference between every website using its own, bespoke URL patterns, and a universal, well structured header field. There is no need for the client to be complicit and make things easier for the tracking party.
  • I appreciate the point that CDN / TLS terminating parties (not HTTP terminating, thank you for the correction :) ) can inject JS. The concern here is how much tracking information a curious but honest party can see, not a malicious party (since in that case, all bets are off already…). Its not a matter of the server trusting the CDN, the focus is client information winding up in more, easily extractable places.

Also, fwiw, for examples of logs that collect all header data, the mod_log_forensic module, on the website you linked too, is one example. Systems like snort and bro, want full HTTP header logs, etc. More to the point though, the expectation is that as more track-able information winds up in these headers, more parties will become interested in them. Such is the way of the web ¯_(ツ)_/¯

@mnot
Copy link
Member

mnot commented Feb 19, 2019

@snyderp it seems like you're arguing that CH should be a net improvement of privacy, as opposed to current practice. While that's a goal that most everyone here shares, it isn't an explicit criteria that we've been applying to the spec; rather, the bar that I think we've held (perhaps implicitly) is that it's no worse than current practice, in any meaningful way.

Is that the case? If so, it'd probably be best to have a discussion explicitly about the goals, so we can determine consensus and then consequences.

@pes10k
Copy link
Author

pes10k commented Feb 19, 2019

Howdy @mnot: I am not arguing that CH should be a net improvement of privacy. My position is that CH is would result in something worse than current practice, for multiple reasons:

  1. Third parties gain access to finger-printable values that they currently don't have access too (finger-printable values would now be in well structured CH headers and seen by CDNs / TLS terminators, where before they were either absent or ay worse sometimes indirectly available in unstructured URLs)
  2. Finger-printable values are show up in locations where they are likely to persist for longer (in HTTP logs for above)
  3. Finger-printable values are shared to first parties (and delegated parties) in more ways. Previously browsers needed to defend against JS based determinations, now they need to defend against JS based determinations AND header-extractions.

@mnot
Copy link
Member

mnot commented Feb 22, 2019

Thanks for the clarification.

I think you need to qualify (1). As discussed, CH is currently only sent over TLS, and so from the perspective of the web, is communication between two parties. It's true that some sites contract out some of their server capabilities to others (CDNs, cloud hosts, data centres, etc.) but those are generally still considered first-party interactions, because the entity in question is acting on behalf of the owner of the server, and that relationship is typically overseen by a legal agreement, as well as various rights and responsibilities in different jurisdictions.

Also, "third party" generally is used to refer to off-site services with a different origin; e.g., ads., so it's a bit confusing to use it here.

Terminology aside, there is some precedent for the argument you're making in (2), but generally when the community has been concerned about sensitive data being logged, it's been because it appears in URLs, which were (originally) designed to be written down, logged, etc. Extending that argument to include headers in an encrypted connection is new. That's not a roadblock, but it would help immensely to establish and get agreement upon the underlying principle.

For (3), I think the basis of the decision is likely to be whether the added functionality -- including privacy improvements, as we're seeing discussed in Mike West's User-Agent replacement discussion -- balances out the increased complexity. More information on both sides would probably help, although I do note that browsers haven't been terribly shy of complexity to date...

Are you likely to be coming to the IETF meeting in Prague? That's probably where we're going to try to move this (and similar issues) forward.

@pes10k
Copy link
Author

pes10k commented Feb 22, 2019

@mnot Thank you for the reply.

I'm not sure I follow what you're asking for by "qualifying" the concern in (1) though. I'm happy to change terminology if it eases the conversation, but the concern seems pretty straight forward. User X wants to visit website Y who sits behind CDN (or similar) Z. Currently, if Z wants to extract these fingerprinting-sensitive values from X's conversation with Y, it requires some out of band communication with Y (access to Y's DB, or something like that, assuming Y is storing them). In a CH-world, Z gets well-structured access to those values, in a way they don't currently have. Thats a loss of privacy to X that does not currently exist.

Re (2) again, happy to contribute anything I can to the conversation here, but not sure what else can be added beyond the original concern. Is it helpful frame it as:

"privacy on the web, and in the growth of web standards, has been a demonstrable catastrophe. The first principal when proposing new standards should be "do no harm", or failing that, "here is an extremely compelling goal (e.g. not moderate perf improvements) that justifies further harm to privacy". Adding finger-printable values to HTTP headers, in such a way that they can be easily collected, aggregated and shared en mass, is a significant privacy harm.

(3) The concern here is not about implementation complexity. Just to try and reach some point of agreement, so that we can discuss further from there:

Setting aside the UA aspect, and dealing only with viewport height/width, DPR, ECT and DeviceMemory, etc: can we agree that CH is strictly a harm for privacy? In the best case (user agent declines the server's request for the headers, site falls back to JS based value extraction) its a push, and in all other cases its a loss (fingerprinting values are in more locations than they were before, and accessible by more parties).

If we don't agree to the above, i would greatly appreciate it if you could clarify by way of the following: what is a scenario where a) the server / website wants to get access to finger-printable values, and b) the client doesn't want to yield them, that CH helps improve privacy? (keeping in mind that a site that currently gets them through JS, will ask for the in CH, and if they don't get them, continue to ask for the in JS).

If I could better understand a case that fit the above scenario, it might help me much better understand where ya'll are coming from, and where I may be misunderstanding things :)

Unfortunately, I will not be able to join ya'll in Prague this year

@yoavweiss
Copy link
Contributor

I'm not sure I follow what you're asking for by "qualifying" the concern in (1) though. I'm happy to change terminology if it eases the conversation, but the concern seems pretty straight forward. User X wants to visit website Y who sits behind CDN (or similar) Z. Currently, if Z wants to extract these fingerprinting-sensitive values from X's conversation with Y, it requires some out of band communication with Y (access to Y's DB, or something like that, assuming Y is storing them). In a CH-world, Z gets well-structured access to those values, in a way they don't currently have. Thats a loss of privacy to X that does not currently exist.

CDNs are delegates of the origin and are not considered part of the threat model. A rogue CDN (similar to a compromised server) can do way more damage than read out fingerprintable information.

Setting aside the UA aspect, and dealing only with viewport height/width, DPR, ECT and DeviceMemory, etc: can we agree that CH is strictly a harm for privacy?

We certainly can not agree on that! (and repeating that multiple times will not suddenly make it true)

We went through great lengths to make sure the current mechanism does not allow passive fingerprinting, and its use can be treated by UAs similarly to the equivalent JS APIs. Even if we just discuss the viewport-width, DPR, ECT and DeviceMemory, the fact that the information is communicated in a convenient and standard way does not mean that it's easier to exploit it compared to the JS APIs. OTOH, it makes it easier to never keep it in or scrub it from server/CDN logs (e.g. compared to the same information hidden as URL parameter conventions).

If we don't agree to the above, i would greatly appreciate it if you could clarify by way of the following: what is a scenario where a) the server / website wants to get access to finger-printable values, and b) the client doesn't want to yield them, that CH helps improve privacy? (keeping in mind that a site that currently gets them through JS, will ask for the in CH, and if they don't get them, continue to ask for the in JS).

If I could better understand a case that fit the above scenario, it might help me much better understand where ya'll are coming from, and where I may be misunderstanding things :)

In your scenario above, how do Client Hints make things worse? If the information is exposed through JS, it is exposed. Making it available through another channel doesn't add any new fingerprintable data that attackers can abuse.

@mnot
Copy link
Member

mnot commented Feb 25, 2019

Unfortunately, I will not be able to join ya'll in Prague this year

Keep in mind, there is remote participation.

@pes10k
Copy link
Author

pes10k commented Feb 25, 2019

@mnot I would be very happy to participate in this conversation remotely when its happening in Prague. Any information you could link to / share about how to participate would be greatly appreciated. I am new to the mysterious world of IETF :)

@mnot
Copy link
Member

mnot commented Feb 25, 2019

@snyderp welcome :)

General resources listed here:
https://www.ietf.org/about/participate/

Meeting details here:
https://www.ietf.org/how/meetings/104/

Best thing to do is to register for remote participation (free), and watch the agenda (linked from the meeting details above) for links to the audio and video feeds (should be added when the agenda is final). Jabber is also a primary communication channel during the meeting; [email protected].

@pes10k
Copy link
Author

pes10k commented Feb 26, 2019

CDNs are delegates of the origin and are not considered part of the threat model. A rogue CDN (similar to a compromised server) can do way more damage than read out fingerprintable information.

This is addressing the concern by defining it away. If they're not part of the threat model, then they certainly aught to be. They're distinct parties, with potentially distinct (if any) commitments to the visitor, etc.

To say a CDN could do something worse is unrelated; sure they could. But that is unrelated to whether it is a good idea to provide them with the data they need to easily finger print users! The relevant framework here is an honest-and-curious vs malicious distinction.

We went through great lengths to make sure the current mechanism does not allow passive fingerprinting, and its use can be treated by UAs similarly to the equivalent JS APIs.

Just to lower the temperature in the room: I appreciate and don't mean to denigrate how much effort you've put into this. I'm sure its a lot of work, and I'm sincerely grateful folks like you are working hard to find ways to improve the web. :) But it doesn't change the privacy harm in CH as it stands…

Even if we just discuss the viewport-width, DPR, ECT and DeviceMemory, the fact that the information is communicated in a convenient and standard way does not mean that it's easier to exploit it compared to the JS APIs. OTOH, it makes it easier to never keep it in or scrub it from server/CDN logs (e.g. compared to the same information hidden as URL parameter conventions).

More examples:

  1. User disables JS as a tracking counter measure (e.g. no script). A browser implementing CH will end up still leaking FP values
  2. User has no-op'ed the JS api's in question to reduce tracking (e.g. privacy badger). A browser implementing CH will end up still leaking FP values
  3. User blocks execution of just JS resources to avoid leaking the above vales. A browser implementing CH will end up still leaking those FP values on non-JS resource fetches
    etc.
  4. User finds it okay to share the FP fields to the origin (and conceivably, even the 3p), but doesn't want to give CDN's and similar convenient access to all of the above. They might doubly want to reduce the chance that the data sticks around for a long time in a log). A browser implementing CH ends up advertising these FP values to anyone between the client and the server.

I understand and take your point that there are cases where a CDN or similar could guess at some of these values from some domains buy guessing at URLs, but again, there is a quantity-to-the-point-of-quality difference between CDNs using some manually curated set of per-domain-per-application-version regexs, and providing that data in a structured header field.

In your scenario above, how do Client Hints make things worse? If the information is exposed through JS, it is exposed. Making it available through another channel doesn't add any new fingerprintable data that attackers can abuse.

They make it worse by increasing the attack surface the client needs to defend against. I hope the above examples help explain. Also, making the same data in multiple places (and a different set of parties), making it much easier for middle parties to log and preserve the FP values, and relying on browsers to deploy additional countermeasure / standards-deviations to avoid further privacy harm are all examples of additional privacy risk / loss in the standard.

But happy to have reached a point where we're at least not arguing that CH somehow reduces FP surface: (e.g. #767 (comment)) 😀

@pes10k
Copy link
Author

pes10k commented Feb 26, 2019

sorry to spam the thread but this has grown to three related, but distinctl concerns:

  1. FP values winding up in places where they're likely to be long term persisted (e.g. CDN logs and similar)
  2. A general growth in the number of parties who will have access to FP values, who didn't before
  3. That (whether?) its more of a privacy leak to provide middle-parties access in FP values in HTTP Headers vs. possibly extracting from URLs

I'm happy to keep them all in this thread, but if folks would like me to spread out to different issues, i'm happy to do so too

@yadij
Copy link

yadij commented Feb 26, 2019

yoavweiss :

CDNs are delegates of the origin and are not considered part of the threat model. A rogue CDN (similar to a compromised server) can do way more damage than read out fingerprintable information.

snyderp:

This is addressing the concern by defining it away. If they're not part of the threat model, then they certainly aught to be. They're distinct parties, with potentially distinct (if any) commitments to the visitor, etc.

I think what Yoav should have said is that CDN are considered part of the origin. So there is no separate model. All privacy and security aspects for "origin" also apply to CDN.

You keep stating that CH expands the exposure. But look at the exposure scopes in the threat matrix I list below.

snyderp :

But happy to have reached a point where we're at least not arguing that CH somehow reduces FP surface: (e.g. #767 (comment)) grinning

That was a statement by me. AFAIK yoav has never had that position.
FTR my position remains that CH (alone) has better privacy than the status quo (JS hacks). Consider these two scenarios:

scenario 1: a request arrives for some random URL. This URL is stored, logged, and passed around. The response to this request has a response lacking cacheability headers. For performance vague responses are cached. Due to the permutability of query-string values there may be N copies of this URL+object stored in M caches around the world - for potentially 68 year long timespans.
=> even with HTTPS protection the FP data can be retrieved from any of these intermediary data sources long after the transaction is over.

scenario 2: a request arrives with C-H header details. The URL is clean - so no danger from logging and passing that around within the intermediary system. The request C-H headers (being request headers) are not cached with the response (if they are used by Vary/Key etc it is in the form of a crypto hash).
=> the clients FP data is only ever in memory during the transaction active period, is never seen by most of the intermediary system components, and cannot be recovered from the long-term storage.

So tell me again how scenario #1 is better for privacy?

synderp :

In the best case (user agent declines the server's request for the headers, site falls back to JS based value extraction)

That is actually a case near worst-case.

The exposure matrix is a 2x3 [ [JS, CH], [send, fallback, omit] ]. So these:

JS-only (the status quo):

  • FP exposure to all HTTP agents along the path
  • FP exposure to all log processors
  • FP exposure to all filesystem agents
  • FP exposure to other networking services on intermediary host
  • FP exposure to any services the above leak URL and/or cache data to
  • expected persistent FP exposure to all the above for up to 68 years

CH-only

  • FP data only by request.
  • conclusion: less FP data, therefore less exposure than status quo.
  • FP exposure to all HTTP agents along the path
  • conclusion: less agent types, therefore less exposure than status quo.

CH with JS fallback

  • client is forwarned of FP actions so can proactively close off JS APIs
  • bias toward CH-only exposure, with risks of status quo amounts of FP exposure.
  • conclusion: less or equal FP exposure to status quo.

JS with CH fallback

  • bias towards status quo FP exposure.
  • conclusion: equal FP exposure to status quo.

Both CH and JS data

  • CH exposure is a sub-set of JS exposure.
  • conclusion: equal FP exposure to status quo.

Neither

  • no FP exposure within scope of the model.

One may argue that agents without JS support are being added to the exposure set. However I counter that the FP data is already visible to such agents in the form of URL values. The presence of FP data in URLs is already where the worst types of leaks are occuring with the JS-only approach. Simply closing off those major avenues of exposure is the reduction in surface I referred to earlier and still believe is offered by CH.

@inian
Copy link
Contributor

inian commented Feb 26, 2019

  1. User disables JS as a tracking counter measure (e.g. no script). A browser implementing CH will end up still leaking FP values
  2. User has no-op'ed the JS api's in question to reduce tracking (e.g. privacy badger). A browser implementing CH will end up still leaking FP values
  3. User blocks execution of just JS resources to avoid leaking the above vales. A browser implementing CH will end up still leaking those FP values on non-JS resource fetches
    etc.

Similar to how the user can block scripts from running in his browser, he could block the CH headers from being sent as well (and it is not significantly harder to do so). These privacy protecting extensions like noscript, privacy badger, etc. could be updated to remove Client Hints from non-origin requests. If the user is willing to trade off performance for privacy, he has the choice to do so that way. Privacy focused extensions (or browsers) would have a safe way to block this than trying to strip off query parameters off a URL for example.

I also agree that CDNs in front of the origin shouldn't be considered as part of the threat model here. A malicious CDN could do a lot more than logging privacy sensitive values.

@tildelowengrimm
Copy link

The argument that CH is a no-op seems pretty tenuous. CH takes a collection of things which can be requested by JS and can be put in a URI and puts all of them together in webserver logs. That's a worse privacy outcome. Suggesting that privacy-protecting browsers/extensions might block sending CH in the same way that they block other fingerprintable attributes just seems to admit that CH adds another vector for all the same badness we already have.

The proposal seems to add yet another fingerprinting vector which is maybe only roughly as bad as all the others and can be blocked in the same way. Some people seem to be saying that's not a big deal. I think that's a bad idea and it shouldn't happen.

The web platform ossifies privacy-harming functionality. In practice, it's very difficult for user-agents to defang established fingerprinting techniques. Adding more because they're only roughly as bad as existing ones (NB: I actually think CH is worse, because it's passive and likely to end up in logs) is the wrong direction. We should eliminate existing ways to track people, not add more new ones.

@mnot
Copy link
Member

mnot commented Feb 26, 2019

CH takes a collection of things which can be requested by JS and can be put in a URI and puts all of them together in webserver logs.

Can you expand upon this? They aren't logged by default in any implementation I'm aware of; one would have to go out of their way to do so in all Web servers, proxies and CDNs that I've ever used.

It's true that servers can go out of their way to log (and then misuse) this information. They can also do so if it's encoded into the URI, added to proprietary request headers, pinged to a server in a request body by active content, and so on.

So I'm having a hard time believing that CH makes things more "likely to end up in logs". If a server, proxy or CDN is accidentally logging this information, they've got much bigger issues with handling sensitive information than just those introduced by CH. If a server is intentionally logging this information, the slight convenience afforded by CH (as opposed to current methods) doesn't seem like it's going to move the needle for them; if they want this information, they're going to get it anyway.

To me, the more interesting difference is that in CH, the wire form is standardised. That cuts both ways; having a standard form makes it slightly more easy for a generic server to provide a facility to log / otherwise mangle the information; however, it also makes it easier to identify sensitive information for purposes of research / analysis / blocking / etc.

@pes10k
Copy link
Author

pes10k commented Feb 27, 2019

Folks the argument that "middle parties / CDNs can do worse things, so no harm in giving them FP values too" is way off base. Again, an honest-but-curious vs malicious distinction. Saying "X party could do worse" doesn't justify making it easier to do other things easily!

Among many other reasons, its possible to detect when CDNs misbehave now (at least in the content injecting ways discussed above). The proposal imagines enabling CDNs to trivially conduct a new type of misbehavior, in a way that cannot be detected

It's true that servers can go out of their way to log (and then misuse) this information. They can also do so if it's encoded into the URI, added to proprietary request headers, pinged to a server in a request body by active content, and so on.

I don't think this is correct. There are a million bespoke ways this information could be encoded in a URL, some easy to pragmatically extract (e.g. query params), some not (e.g. packed into custom formatted blobs). In the status quo, an observer would need to come up with patterns to cover every conceivable way of packing these values into a URL string, and keep it updated every time an application changes patterns, etc. In a CH world, the values are always nicely formatted, in a consistent place, trivial to extract. This is what I mean by a quantity-to-the-point-of-quality problem. CH makes whats currently a difficult, constantly changing, not-generally-solveable problem into something trivial, exactly the kind of thing that could be trivially automated, aggregated and sold / shared / leaked.

To put it a different way, given a set of 1m requests in a log, would you rather be the person in charge of coding up a system for extracting viewport-width, DPR, ECT, DeviceMemory, etc values from those logs in a pre, or a post, CH world. :)

however, it also makes it easier to identify sensitive information for purposes of research / analysis / blocking / etc.

Given the amount of literature documenting how often these end points are already abused, I don't think this counts as a "win". We already know people abuse this stuff! There is no win in making it easier to count the abuse; the win is in making it harder to conduct the abuse.

@mnot
Copy link
Member

mnot commented Feb 27, 2019

@snyderp the argument you seem to be making can be generalised to "Let's not standardise any semantics for potentially sensitive data, because if there's a bad actor involved in handling that data, it makes their life easier."

Does that capture it?

@pes10k
Copy link
Author

pes10k commented Feb 27, 2019

My argument is stronger than "lets give them a difficult time" :)

The argument is that CH turns a problem that could be "solved" for some cases with a great deal of manual effort, fragile rule generation, and ongoing maintenance (since URL -> FP value extraction rules would constantly be changing and need to be updated), into something that can be "solved" in all cases with echo $_SERVER["<whatever>"];

@mnot
Copy link
Member

mnot commented Feb 27, 2019

I think the disconnect here may be in the relationship between CDNs and other "third parties" (e.g., contractors, hosting providers, data centres) and the origin (i.e., the responsible party). IME all of these relationships are highly coordinated and governed by a contract.

Since they're coordinated, working together effect this sort of extraction currently is fairly trivial; CH doesn't significantly lower the bar. If the "third party" does this sort of thing without the consent of the origin, it's breaking the contract, which they have a strong incentive not to do.

You seem to consider the contractual constraint there as inadequate. Is that closer?

@pes10k
Copy link
Author

pes10k commented Mar 4, 2019

  1. To the degree there is a contractual constraint[1], its not going to be known to the client; it would be irresponsible for the browser to assume "the origin has made the middle party(s) promise to protect my privacy".

More importantly though, if the state of web privacy demonstrates anything, it is that there will quickly become parties that take advantage of (and monetize) any new FP attack surface. The list of FP techniques that initially seemed like "no one would actually do that", but then became widely deployed is long long long… (really long!). Adding FP values in CH headers will have the same outcome

  1. I think you're overstating the degree to which the origin ~= the hosting provider ~= CDN. In some situations they may make sense to collapse into "one party". But thats not the case here; they have access to distinct amounts of information, and CH unambiguously expands the number of parties that have access to privacy-harmful information.

Currently, there is no simple way for CDNs / middle parties to gain access to these FP values in a predictable, easy, consistent manner. CH would create a way for the middle parties to have predictable, easy, consistent access to FP values.

Saying "they're both 1p so there is privacy loss" seems like papering over the plain truth of the situation.

  1. if there is a survey of the promises CDNs / middle parties / the like make about client privacy, I would be extremely interested in it. It doesn't change the fact that sending FP values in CH headers is harmful to web privacy, but it would be extremely interesting to read either way. If you know of such a document, I'd be grateful for a link.

@yoavweiss
Copy link
Contributor

yoavweiss commented Mar 11, 2019

if the state of web privacy demonstrates anything, it is that there will quickly become parties that take advantage of (and monetize) any new FP attack surface

That could be true, but Client Hints does not increase the fingerprinting attack surface

Currently, there is no simple way for CDNs / middle parties to gain access to these FP values in a predictable, easy, consistent manner

They can easily inject scripts that would send them those values in a predictable, easy, consistent manner. There are examples of CDNs injecting scripts as a premium service for analytics or content optimization purposes. Do you have any evidence of well-known CDNs injecting scripts today in order to fingerprint their customers' users without the customer's consent and active participation?

It doesn't change the fact that sending FP values in CH headers is harmful to web privacy

You keep making that unsubstantiated claim without any material evidence to back it up, after which you ask for evidence to the contrary. That's not how it typically works.

@pes10k
Copy link
Author

pes10k commented Mar 11, 2019

That could be true, but Client Hints does not increase the fingerprinting attack surface

The claim is (still) that CH makes it easier for middle parties to fingerprint users

I claim this because, in the present, middle-parties can only fingerprint users in one of two ways:

  1. trying to extract parameters from URLs (error prone, difficult and not possible with a general solution), or
  2. an active attack / traffic tampering (injecting JS, which clients can try to defend against)

The CH proposal would provide these values to middle parties in a way that makes finger printing easier and more common. With CH, middle parties could fingerprint users trivially (no need to guess at FP parameters from URLs, etc), passively (e.g. honest but curious attack scenario), and with a common solution (e.g. reading from the header).

It seems we keep talking past each other. Maybe this could be more productive if we could narrow the conversation. Do you disagree with 1 or 2 above, or the conclusion I draw from them?

You keep making that unsubstantiated claim without any material evidence to back it up, after which you ask for evidence to the contrary. That's not how it typically works.

Surely it's the person proposing the change that could potentially harm the privacy of billions of people on the web who has the burden of proof! Not the person saying "this seems risky, lets make sure…"

@mnot
Copy link
Member

mnot commented Mar 12, 2019

I claim this because, in the present, middle-parties can only fingerprint users

I think we need to establish consensus about the threat model for what you're calling "middle-parties" before we can mitigate concerns raised about them. AFAICT this is a very new argument.

@pes10k
Copy link
Author

pes10k commented Mar 12, 2019

I worry that formalism will obscure the facts (which is that a class of party distinct from the origin will gain access to a new category of privacy sensitive information), but I'm happy to take a stab at it here if you think it will be helpful for the conversation (even if it may need several revisions to get right and tight).

Here is a first stab:

Middle parties are HTTPS terminators, like CDNs, outsourced reverse proxies, etc. They may have commitments to the origin, they do not have commitments to the client (including regarding privacy). While they may resort to malicious attacks (traffic tampering, etc) the primary concern is the honest-but-curious scenario, where they maximize the utility of the data they can gain w/o breaking protocol (i.e. they squeeze every $ out of every data point they can see, but don't modify traffic, inject JS, etc).

I hope this helps @mnot . CH is privacy harmful because it increases the amount of fingerprinting middle parties can do, by taking something they can only occasionally do now (extracting FP parameters from URLs, by trying a variety of faulty, imperfect, non-generalizeable pattern matching strategies) and changing it into something they can do trivially (reading HTTP header values).

@pes10k pes10k changed the title CH, Logging and passive tracking / fingerprinting Client-Hints exposes fingerprint values to additional parties and logging sensitive locations Apr 8, 2019
@diracdeltas
Copy link

Middle parties are HTTPS terminators, like CDNs, outsourced reverse proxies, etc

Labeling TLS-terminating CDNs as a distinct security party from the server-which-is-behind-the-CDN seems problematic to me.

  1. Clients have no way of distinguishing whether they are talking to a TLS-terminating CDN or a non-CDN'ed site, so this essentially proposes a security boundary which cannot be enforced by clients. This means that any information which should not be revealed to these "middle parties" must be denied for all parties.
  2. There is nothing in any browser's UX that I know of which indicates to the user whether they are talking to a TLS-terminating CDN versus the site's server directly. Even if (1) were possible, this seems like a tricky concept to communicate to users.
  3. Cookies are sent as a header and are usually more of a privacy/security risk than the client-hints header. Does this mean we should deprecate cookies in favor of JS-accessible state storage mechanisms like localStorage?

Re: the claim that headers are more likely to be logged, I too would like to see some data for this. It seems that if this is a concern, we should be equally or more concerned that CDNs are logging cookies (which are also sent as headers).

The most convincing argument I see for not implementing CH at this time is that it adds complexity and another vector which must be blocked when the client blocks data that CH would otherwise send. (For instance, if a user blocks scripts or installs an extension to block certain fingerprinting vectors, the browser must make sure the data is correspondingly blocked in CH.)

@igrigorik
Copy link
Member

Recapping discussion from IETF 105..

  • The group does not consider the CDN as adversarial threat model
  • The CDN can, sometimes, be part of accidental threat model (misconfiguration, etc)
  • We should indicate in security considerations that client hints might carry sensitive information and that they should be treated with care — WIP Add Sec- prefix to security considerations #776 PR is aiming to capture this.

Sparse but relevant minutes from the meeting.

@yoavweiss
Copy link
Contributor

Given the discussion at IETF 105, can we close this?

@igrigorik
Copy link
Member

I believe so, I've not heard any followup or rebuttals since our discussions at 105. I'll close this out, if anyone disagrees with the outcome please feel free to reopen.

@pes10k
Copy link
Author

pes10k commented Nov 19, 2019

I opened this issue, but was not at IETF. Could you kindly summarize the conclusion, and why the issue is being closed? Thanks!

@yoavweiss
Copy link
Contributor

#767 (comment) provides a summary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

9 participants