Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add privacy considerations #99

Closed
wants to merge 13 commits into from
Closed

Conversation

yamdan
Copy link
Contributor

@yamdan yamdan commented May 12, 2023

This PR is a draft of Privacy Considerations section, which consists of (1) data leakage in selective disclosure schemes and (2) unlinkability, based on the discussion in #84 . While we could additionally mention a kind of tradeoff between these two notions, I have not dealt with it because I am still not fully convinced (confident) that such a tradeoff does indeed exist...

Fixes #84.


Preview | Diff

Copy link
Member

@gkellogg gkellogg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great analysis, and I made a couple of suggestions, but we may evolve the text further. I note that some of the text discusses alternatives (HMAC instead of Hash, re-ordering quads) that don't represent the decisions actually taken by the spec. I'm wondering if we need to provide a rationale, or suggest other ways to mitigate these vulnerabilities compatible with the spec.

spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
Co-authored-by: Gregg Kellogg <[email protected]>
Co-authored-by: Ted Thibodeau Jr <[email protected]>
@yamdan yamdan marked this pull request as ready for review May 15, 2023 05:48
@yamdan yamdan requested a review from dlongley as a code owner May 15, 2023 05:48
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
<p>By making the canonicalization process private, we can prevent a brute-forcing attacker from trying to see the
labeling change by trying multiple possible attribute values.
For example, we can use a HMAC instead of a hash function in the canonicalization algorithm. Alternatively, we can add
a secret random nonce (always undisclosed) into the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a similar remark, with the difference that we have not, in my recollection, discussed to add a nonce to the algorithm (I presume to be used with the hash function). I think it would be possible and probably worthwhile to allow that, but that would require a separate WG resolution (and further test cases, b.t.w.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to just make it clear that other specs may parameterize the hash function using whatever they like -- noting that it will not longer be URDNA2015 (or whatever we end up calling it), but instead some variant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming aside, I would expect all parameters would have a default, probably using SHA256 for hash, no nonce, and sorting just the good old unicode sorting as now. Ie, if we do these additions with defaults we would not harm deployed code (I presume that is your worry, @dlongley). Defining another spec that does exactly the same but parametrizing the hash function sounds a bit of an overkill to me...

spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
@iherman
Copy link
Member

iherman commented May 15, 2023

I note that some of the text discusses alternatives (HMAC instead of Hash, re-ordering quads) that don't represent the decisions actually taken by the spec.

Yep, I referred to those problems in my review as well. For the HMAC (or the usage of the nonce) we may be able to extend the algorithm relatively easily to incorporate those; actually, I believe we have already discussed the alternative at some point. In fact, this privacy analysis might be the very reason why we should make those changes.

As for the re-ordering that is less clear. But... what about the possibility to provide, as a parameter to the algorithm, an alternative sorting function? In most (all?) environments I have worked with the sorting function can be parametrized with such a functional parameter, so this would not create a huge load on the algorithm overall, and might also be a way to de-correlate data to the outside world...

@dlongley what do you think?

@dlongley
Copy link
Contributor

I note that some of the text discusses alternatives (HMAC instead of Hash, re-ordering quads) that don't represent the decisions actually taken by the spec.

Yep, I referred to those problems in my review as well. For the HMAC (or the usage of the nonce) we may be able to extend the algorithm relatively easily to incorporate those; actually, I believe we have already discussed the alternative at some point. In fact, this privacy analysis might be the very reason why we should make those changes.

I commented above that we should probably just make it more clear that the hash function can be replaced with something else by another spec if desired (but then this should not be considered URDNA2015, and should be detailed as some other variant in that other spec).

Notably, whether one should use a different hash function within the canonicalization algorithm or outside of it on only the resulting bnode labels is an important trade off. In the former case, you expose the number of blank nodes in the total dataset because the bnode labels are ... numbered. This is less of a problem for unlinkability but more of a problem for minimizing data leakage. In the latter case, leaking this information can be avoided by using the hash values themselves as labels, however, it is harmful to unlinkability as the hash values are uniquely identifying.

Which of these choices makes sense depends on the use case.

As for the re-ordering that is less clear. But... what about the possibility to provide, as a parameter to the algorithm, an alternative sorting function? In most (all?) environments I have worked with the sorting function can be parametrized with such a functional parameter, so this would not create a huge load on the algorithm overall, and might also be a way to de-correlate data to the outside world...

@dlongley what do you think?

I don't think using an alternative sorting function is going to work well. With what we have today, other specs can make use of the abstract dataset output and bnode label mapping to produce whatever sorted (or not) ordering they want in the output, it just won't match the "canonical serialization", which I think is fine.

Ultimately, for the unlinkability case, you want a template that causes "the same order" to always be used anyway -- which is quite different from using a "custom sort function", at least conceptually. For this, I'd expect you would define your own input bnode labels via a template (that would be the same for everyone) and then you'd get the mapping from your input labels to the canonical bnode labels from the canonicalization algorithm. Next you'd generate a reverse mapping back for the specific signed document (e.g., VC) to travel along with its holder -- or to be later recomputed if the "template" were made publicly available. A similar mapping for the quad order itself could be used.

In fact, it may make more sense, in the unlinkability case, to invent a new URI scheme that is used to identify "herd subjects" instead of "unique subjects" -- and avoid using bnodes altogether. The whole point is to not be able to link this data back to anything from a prior or future presentation. This may be tricky or impossible for lists, but perhaps using those opposes the unlinkability use cases anyway.

I think this area needs more research and a better understanding of real world use cases, but I don't expect our timelines to work out for that. I find in this area that there's often a lot of focus on the power of math and cryptography to achieve unlinkability (which is quite cool!), but less focus on the practical usage, required trust or threat models, and potential impact on people's freedom to use the devices and service providers they want to as a result ... and so on.

The selective disclosure (without regard to unlinkability) use cases are much more clear at today's date, IMO. However, I do think that we're close to striking the right balance of having a simple enough primitive and having enough knobs -- in order to support both endeavors.

@dlongley
Copy link
Contributor

dlongley commented May 15, 2023

To elaborate more on the general issue here:

In order to perform canonicalization, i.e., "make the dataset look the same", we use the data within the dataset itself. We have no other input data. This necessarily means that the output for those things that we canonicalize (i.e., blank node labels and total ordering) will depend on the input.

Notably, this is desirable in all of the VC use cases as a step in some transformation process. We want people to be able to write software that when run on machine A or B or that takes in a dataset in format X or Y -- will be able to modify it to get to the same starting point. However, it may be undesirable for it to be the final step prior to applying cryptography. I think this is the main tripping point and we should highlight this apparent inversion to intuition.

The additional transformation steps often involve reestablishing independence from the input data in how the dataset is expressed -- because this coupling can have undesirable privacy characteristics. This means: blank node labels and the total ordering should be made to not depend on the input data (or at least indistinguishably appear not to). Note that this is not just "undoing the work of canonicalization" -- as a baseline has to be established before the work can proceed or else different software may produce different results.

So, what is needed from our spec to accomplish this, I think, are mappings that connect what is done in the canonicalization algorithm to the original inputs, so algorithm authors can make whatever adjustments they need to. We have also parameterized the internally used hash function to allow that to be replaced -- but, as I noted above, changing this to something like an HMAC will only cause bnode labels to become partially independent from the data in the dataset. I think only applying an HMAC (or similar) function after canonicalization has finished can guarantee full independence.

Similarly, applying a sort order that is based on the input data via a custom comparator (as an example), will also "be operating on the data". Of course, such a comparator could do something like HMAC each N-Quad as they are sorted to produce a pseudorandom but stable ordering -- but this could also be applied to the normalized dataset instead of using the default canonical N-Quads serialization and ordering, without us needing to provide the capability from "within" our canonical serialization description.

Given what I see is a need for reestablishing independence from the data as the driving factor here, I don't think adding even more knobs to be run within the canonicalization algorithm -- which makes its decisions based on the data -- is really helpful. It's better to ensure we can output mappings and don't do too much in a single step by separating bnode label computation from canonical serialization so they can be called independently. But ... I feel this is what we've ended up on already (once issue #89 is addressed).

@iherman
Copy link
Member

iherman commented May 16, 2023

Given what I see is a need for reestablishing independence from the data as the driving factor here, I don't think adding even more knobs to be run within the canonicalization algorithm -- which makes its decisions based on the data -- is really helpful. It's better to ensure we can output mappings and don't do too much in a single step by separating bnode label computation from canonical serialization so they can be called independently. But ... I feel this is what we've ended up on already (once issue #89 is addressed).

@dlongley I do not have any issues with what you say (not really having an experience in these problems anyway😀). I have come to this from a different angle. The privacy section, as proposed in this PR, seems to suggest that the problems can be solved within the framework of the current specification. If we decide not to add more knobs (which is fine with me) then the relevant texts in the proposal (different hashes, usage of nonce or not, sorting) should be formulated to make it clear that a privacy preserving application must use a slightly different algorithm rather than the one defined by the WG. The consistency of the specification requires that. On the other, with my W3C staff's hat on, my worry is the the privacy review will pick on this and will require us to make such modifications (ie, adding "knobs") to make the algorithm defined by this WG properly prepared for privacy issues.

At this point I would leave it to you and @yamdan to figure out a proper formulation to possibly clarify this; again, my knowledge and experience on these issues are not fool-proof enough...

spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
@peacekeeper
Copy link
Contributor

We discussed this issue during the 16th May 2023 WG call, and there seemed to be consensus that:

  1. the WG felt that the content created in this PR by @yamdan was very valuable. (Thanks!)
  2. the Privacy Considerations section in rdf-canon should be relatively minimal, for example mention some general privacy impacts that could arise from using this work.
  3. but much of the detailed analysis and examples in the PR should be moved "higher up in the spec stack", i.e. to Data Integrity, since many Privacy Considerations have to do less with rdf-canon per se, and more with how rdf-canon is used in specific ways.

So the TODOs are:

  • (to address 3. above): Instead of merging this PR into rdf-canon, create an equivalent PR for Data Integrity.
  • (to address 2. above): Write a new PR for rdf-canon that mentions some general considerations without going into details of particular applications. (Volunteers?)

Co-authored-by: Ted Thibodeau Jr <[email protected]>
@yamdan yamdan marked this pull request as draft May 17, 2023 07:43
spec/index.html Outdated Show resolved Hide resolved
Co-authored-by: Ted Thibodeau Jr <[email protected]>
@yamdan
Copy link
Contributor Author

yamdan commented May 19, 2023

  • (to address 2. above): Write a new PR for rdf-canon that mentions some general considerations without going into details of particular applications. (Volunteers?)

I have tried to write a new PR that mentions more general privacy considerations. #101

  • (to address 3. above): Instead of merging this PR into rdf-canon, create an equivalent PR for Data Integrity.

will do it later...

@yamdan
Copy link
Contributor Author

yamdan commented May 24, 2023

Discussed 2023-05-24. We had consensus to close this PR that has been superseded by PR #101, as mentioned in #99 (comment).
I will make a new PR for the Data Integrity based on the above discussion.

@yamdan yamdan deleted the yamdan-privacy-considerations branch May 24, 2023 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some privacy considerations
6 participants