Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add privacy considerations #99

Closed
wants to merge 13 commits into from
246 changes: 244 additions & 2 deletions spec/index.html
Original file line number Diff line number Diff line change
@@ -2674,8 +2674,12 @@ <h2>Serialization</h2>
<section id="privacy-considerations" class="informative">
<h2>Privacy Considerations</h2>

<section>
<h3>Selective Disclosure Schemes</h3>
<p>Privacy considerations here are primarily worth discussing when the canonicalization scheme is used for
privacy-respecting signed RDF dataset and are likely acceptable for other use cases. One of the former examples is a
verifiable credential with selective disclosure.</p>
yamdan marked this conversation as resolved.
Show resolved Hide resolved
yamdan marked this conversation as resolved.
Show resolved Hide resolved

<section id="privacy-considerations-leakage">
<h3>Data Leakage in Selective Disclosure Schemes</h3>

<p class="issue" data-number="70" title="Dataset structure might reveal information">
Add text that warns implementers using this specification in selective
@@ -2686,8 +2690,246 @@ <h3>Selective Disclosure Schemes</h3>
which might be enough to disclose information beyond what the discloser intended to
disclose.
</p>

<p>Selective disclosure is the ability for someone to share only some of the statements from a signed dataset, without
harming the ability of the recipient to verify the authenticity of those selected statements.</p>

<p>The output of the canonicalization algorithm described in this specification, may leak partial
yamdan marked this conversation as resolved.
Show resolved Hide resolved
information about undisclosed statements and help the adversary correlate the original and disclosed datasets.</p>

<section id="privacy-considerations-leakage-labeling">
<h4>Possible Leakage via Canonical Labeling</h4>

<p>If a dataset contains at least two blank nodes, the canonical labeling can be exploited to guess the undisclosed
quad in the dataset.</p>

<p>For example, let us assume we have the following dataset to be signed. (Note: this person is fictitious, prepared
only to make this example work.)</p>
yamdan marked this conversation as resolved.
Show resolved Hide resolved

<pre id="ex-pc-leakage-labeling-original-dataset" class="example" data-transform="updateExample" title="Original Dataset">
<!--
_:b0 <http://schema.org/address> _:b1 .
_:b0 <http://schema.org/familyName> "Jarrett" .
_:b0 <http://schema.org/gender> "Female" . # gender === Female
_:b0 <http://schema.org/givenName> "Ali" .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:b1 <http://schema.org/addressCountry> "United States" .
_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
-->
</pre>

<p>Using <a href="#canon-algorithm" class="sectionRef"></a>, we can obtain the <a>serialized canonical form</a> of the
<a>normalized dataset</a>, where all the blank nodes are serialized using the canonical labels.</p>
yamdan marked this conversation as resolved.
Show resolved Hide resolved

<pre id="ex-pc-leakage-labeling-normalized-dataset" class="example" data-transform="updateExample" title="Normalized Dataset">
<!--
_:c14n0 <http://schema.org/addressCountry> "United States" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
_:c14n1 <http://schema.org/address> _:c14n0 .
_:c14n1 <http://schema.org/familyName> "Jarrett" .
_:c14n1 <http://schema.org/gender> "Female" . # gender === Female
_:c14n1 <http://schema.org/givenName> "Ali" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
-->
</pre>

<p>The signer can generate a signature for the dataset by first hashing each statement and then signing them
using a multi-message digital signature scheme like BBS+. The resulting dataset with signature is passed to the
holder, who can control whether or not to disclose each statement while maintaining their verifiability.</p>

<p>Let us say that the holder wants to show her attributes except for `gender` to a verifier. Then the holder should
yamdan marked this conversation as resolved.
Show resolved Hide resolved
disclose the following partial dataset. (Note: proofs omitted here for brevity)</p>
yamdan marked this conversation as resolved.
Show resolved Hide resolved

<pre id="ex-pc-leakage-labeling-disclosed-dataset" class="example" data-transform="updateExample" title="Disclosed Dataset">
<!--
_:c14n0 <http://schema.org/addressCountry> "United States" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
_:c14n1 <http://schema.org/address> _:c14n0 .
_:c14n1 <http://schema.org/familyName> "Jarrett" .
########### 5th statement is unrevealed ##########
_:c14n1 <http://schema.org/givenName> "Ali" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
-->
</pre>

<p>However, in this example, anyone can guess the unrevealed statement by exploiting the canonical labels and order.</p>
yamdan marked this conversation as resolved.
Show resolved Hide resolved

<p>Since the dataset was sorted in the canonical order, we can get to know that the hidden statement must start with
`_:c14n1 &lt;http://schema.org/[f-g]`, which helps us guess that the hidden predicate is
`&lt;http://schema.org/gender&gt;` with high probability. Alternatively, we can assume that the guesser already has
yamdan marked this conversation as resolved.
Show resolved Hide resolved
such knowledge via the public credential schema.</p>

<p>Then, if the canonical labeling produces different results depending on the gender value, we can use it to deduce the
gender value. In fact, this example produces different results depending on whether the gender is `Female` or `Male`.
yamdan marked this conversation as resolved.
Show resolved Hide resolved
(Note: ignored the other types of gender just for brevity)</p>
yamdan marked this conversation as resolved.
Show resolved Hide resolved

<p>The following example shows that `gender` = `Male` yields different canonical labeling.</p>

<pre id="ex-pc-leakage-labeling-hypothetical-normalized-dataset" class="example" data-transform="updateExample" title="Hypothetical Normalized Dataset">
<!--
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/familyName> "Jarrett" .
_:c14n0 <http://schema.org/gender> "Male" . # gender === Male
_:c14n0 <http://schema.org/givenName> "Ali" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
-->
</pre>

<p>So the verifier should have obtained the following dataset if `gender` had the value `Male`, which differs from the
revealed dataset. Therefore, the verifier can conclude that the `gender` is `Female`.</p>

<pre id="ex-pc-leakage-labeling-hypothetical-disclosed-dataset" class="example" data-transform="updateExample" title="Hypothetical Disclosed Dataset">
<!--
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/familyName> "Jarrett" .
########### 3rd statement is unrevealed ##########
_:c14n0 <http://schema.org/givenName> "Ali" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
-->
</pre>

<p>Note that we can use the same approach to guess non-boolean values if the range of possible values is still a
reasonable (small) size for a guesser to try all the possibilities.</p>
yamdan marked this conversation as resolved.
Show resolved Hide resolved

<p>By making the canonicalization process private, we can prevent a brute-forcing attacker from trying to see the
labeling change by trying multiple possible attribute values.
For example, we can use a HMAC instead of a hash function in the canonicalization algorithm. Alternatively, we can add
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the remark

we can use a HMAC instead of a hash function

At this moment, the usage of SHA256 is cast in concrete for the canonicalization algorithm. I am not familiar how HMAC works, but would that mean to use a different function instead of SHA256?

Note that we discussed, at some point, to allow the user to set the hashing algorithm instead of fixing it (which is entirely possible because the algorithm does not depend on the specificities of SHA256) but I am not sure we have had a WG resolution on that. I guess if we go along this proposal, than that issue should be raised explicitly.

a secret random nonce (always undisclosed) into the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a similar remark, with the difference that we have not, in my recollection, discussed to add a nonce to the algorithm (I presume to be used with the hash function). I think it would be possible and probably worthwhile to allow that, but that would require a separate WG resolution (and further test cases, b.t.w.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to just make it clear that other specs may parameterize the hash function using whatever they like -- noting that it will not longer be URDNA2015 (or whatever we end up calling it), but instead some variant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming aside, I would expect all parameters would have a default, probably using SHA256 for hash, no nonce, and sorting just the good old unicode sorting as now. Ie, if we do these additions with defaults we would not harm deployed code (I presume that is your worry, @dlongley). Defining another spec that does exactly the same but parametrizing the hash function sounds a bit of an overkill to me...

Note that these workarounds force dataset issuers and holders to manage shared secrets securely.
We also note that these workarounds adversely affect the unlinkability described below because canonical labeling now
varies depending on the secret shared by the issuer and the holder, which will help correlate them.</p>
yamdan marked this conversation as resolved.
Show resolved Hide resolved
</section>

<section id="privacy-considerations-leakage-sorting">
<h4>Possible Leakage via Canonical Sorting</h4>

<p>The canonical order can leak unrevealed information even without canonical labelings.</p>

<p>Let us assume that the holder has the following signed dataset, sorted in the canonical (code-point) order.</p>

<pre id="ex-pc-leakage-sorting-signed-dataset" class="example" data-transform="updateExample" title="Signed Dataset">
<!--
:a <http://schema.org/children> "Albert" .
:a <http://schema.org/children> "Alice" .
:a <http://schema.org/children> "Allie" .
:a <http://schema.org/name> "John Smith" .
-->
</pre>

<p>If the holder wants to hide the statement for their second child for any reason, the disclosed dataset now looks like
this:</p>

<pre id="ex-pc-leakage-sorting-disclosed-dataset" class="example" data-transform="updateExample" title="Disclosed Dataset">
<!--
:a <http://schema.org/children> "Albert" .
########### 2nd statement is unrevealed ##########
:a <http://schema.org/children> "Allie" .
:a <http://schema.org/name> "John Smith" .
-->
</pre>

<p>Knowing that these statements are sorted in the canonical order, we can guess that the hidden statement must start
with `:a &lt;http://schema.org/children&gt; "Al`, which leaks the subject (`:a`), predicate
(`&lt;http://schema.org/children&gt;`) and the first two letters of the object (`"Al"`) in the hidden statement.</p>

<p>To avoid this leakage, the dataset issuer can randomly shuffle the normalized statements before signing and issuing
them to the holder, preventing others from guessing undisclosed information from the canonical order.
However, similar to the workarounds mentioned above, this workaround also adversely affects unlinkability. This is
because there are $n!$ different permutations for shuffling $n$ statements, and whichever one is used will help
yamdan marked this conversation as resolved.
Show resolved Hide resolved
correlate the dataset.</p>
</section>
</section>

<section id="privacy-considerations-unlinkability">
<h3>Unlinkability</h3>

<p>Unlinkability ensures that no correlatable data are used in a signed dataset while still providing some level of
trust, the sufficiency of which must be determined by each verifier. </p>

<p>While canonical sorting works better for unlinkability, canonical labeling can be exploited to break it.
The total number of canonical labelings for a dataset with $n$ blank nodes is $n!$, which is not controllable by the
issuer.
It means that the herd constructed as a result of selective disclosure will be split into $n!$ pieces due to the
yamdan marked this conversation as resolved.
Show resolved Hide resolved
canonical labeling, which reduces unlinkability.</p>

<p>For example, let us assume that an employee of the small company "example.com" shows its employee ID dataset without
their name like this:</p>

<pre id="ex-pc-unlinkability-disclosed-dataset" class="example" data-transform="updateExample" title="Disclosed Dataset">
<!--
########### 1st statement is unrevealed ##########
_:c14n0 <http://schema.org/worksFor> _:c14n1 .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/address> _:c14n2 .
_:c14n1 <http://schema.org/geo> _:c14n3 .
_:c14n1 <http://schema.org/name> "example.com" .
_:c14n2 <http://schema.org/addressCountry> "United States" .
_:c14n3 <http://schema.org/latitude> "0.0" .
_:c14n3 <http://schema.org/longitude> "0.0" .
-->
</pre>

<p>The verifier can always trace this person without knowing their name if this company has only three employees with
the following employee ID datasets.</p>

<aside id="ex-pc-unlinkability-normalized-dataset" class="example" title="Normalized Datasets">

Normalized dataset about the first employee:
<pre id="ex-pc-unlinkability-normalized-dataset-1" data-transform="updateExample">
<!--
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/geo> _:c14n3 .
_:c14n0 <http://schema.org/name> "example.com" .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n2 <http://schema.org/name> "Jayden Doe" .
_:c14n2 <http://schema.org/worksFor> _:c14n0 .
_:c14n2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n3 <http://schema.org/latitude> "0.0" .
_:c14n3 <http://schema.org/longitude> "0.0" .
-->
</pre>

Normalized dataset about the second employee:
<pre id="ex-pc-unlinkability-normalized-dataset-2" data-transform="updateExample">
<!--
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/geo> _:c14n2 .
_:c14n0 <http://schema.org/name> "example.com" .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n2 <http://schema.org/latitude> "0.0" .
_:c14n2 <http://schema.org/longitude> "0.0" .
_:c14n3 <http://schema.org/name> "Morgan Doe" .
_:c14n3 <http://schema.org/worksFor> _:c14n0 .
_:c14n3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
-->
</pre>

Normalized dataset about the third employee:
<pre id="ex-pc-unlinkability-normalized-dataset-3" data-transform="updateExample">
<!--
_:c14n0 <http://schema.org/name> "Johnny Smith" .
_:c14n0 <http://schema.org/worksFor> _:c14n1 .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/address> _:c14n2 .
_:c14n1 <http://schema.org/geo> _:c14n3 .
_:c14n1 <http://schema.org/name> "example.com" .
_:c14n2 <http://schema.org/addressCountry> "United States" .
_:c14n3 <http://schema.org/latitude> "0.0" .
_:c14n3 <http://schema.org/longitude> "0.0" .
-->
</pre>
</aside>

<p>The canonicalization in this example produces different labelings for these three employees, which helps anyone to
correlate their activities even if they do not reveal their names in the dataset.</p>

<p>By determining some "template" for each anonymous set (or herd) and fixing the canonical labeling and canonical order
used in the anonymous set, we can achieve a certain unlinkability.</p>
yamdan marked this conversation as resolved.
Show resolved Hide resolved
</section>
</section>

<section id="security-considerations" class="informative">