Some privacy considerations #84

yamdan · 2023-03-13T09:35:20Z

As a starting point for preparing Privacy Considerations (#70), I am trying to organize some existing discussions about privacy aspects of the URDNA2015, including:

Dave's comment in #4
Blank node labels may leak information · Issue #60 · w3c-ccg/ldp-bbs2020
Academic articles by Vasilis Kalos (1), (2)

While the discussions here mainly focus on URDNA2015, I believe we can apply similar arguments to the other RDF canonicalization algorithms.

Privacy considerations here are primarily worth discussing when the canonicalization scheme is used for privacy-respecting signed RDF dataset.

The following issues are worth discussing when the canonicalization scheme is used for privacy-respecting signed RDF datasets and are likely acceptable for other use cases. One of the former examples is a verifiable credential with selective disclosure.

Selective disclosure is the ability for someone to share only some of the statements from a signed dataset, without harming the ability of the recipient to verify the authenticity of those selected statements. (Note: copied from Dave's comment in #4)

The normalized dataset, the output of the canonicalization algorithm described in this specification, may leak partial information about undisclosed statements and help the adversary correlate the original and disclosed datasets.

a) possible leakage via canonical labeling

If a dataset contains at least two blank nodes, the canonical labeling can be exploited to guess the undisclosed quad in the dataset.

For example, let us assume we have the following dataset to be signed. (Note: this person is fictitious, prepared only to make this example work.)

# original dataset
_:b0 <http://schema.org/address> _:b1 .
_:b0 <http://schema.org/familyName> "Jarrett" .
_:b0 <http://schema.org/gender> "Female" .  # gender === Female
_:b0 <http://schema.org/givenName> "Ali" .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:b1 <http://schema.org/addressCountry> "United States" .
_:b1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .

Using URDNA2015, we can obtain the normalized dataset with canonical labels sorted in the canonical (code-point) order.

# normalized dataset
_:c14n0 <http://schema.org/addressCountry> "United States" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
_:c14n1 <http://schema.org/address> _:c14n0 .
_:c14n1 <http://schema.org/familyName> "Jarrett" .
_:c14n1 <http://schema.org/gender> "Female" .  # gender === Female
_:c14n1 <http://schema.org/givenName> "Ali" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .

The signer (or issuer) can generate a signature for the dataset by first hashing each statement and then signing them using a multi-message digital signature scheme like BBS+. The resulting dataset with signature is passed to the holder, who can control whether or not to disclose each statement while maintaining their verifiability.

Let us say that the holder wants to show her attributes except for gender to a verifier. Then the holder should disclose the following partial dataset. (Note: proofs omitted here for brevity)

# disclosed dataset
_:c14n0 <http://schema.org/addressCountry> "United States" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .
_:c14n1 <http://schema.org/address> _:c14n0 .
_:c14n1 <http://schema.org/familyName> "Jarrett" .
########### 5th statement is unrevealed ##########
_:c14n1 <http://schema.org/givenName> "Ali" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .

However, in this example, anyone can guess the unrevealed statement by exploiting the canonical labels and order.

Since the dataset was sorted in the canonical order, we can get to know that the hidden statement must start with _:c14n1 <http://schema.org/[f-g], which helps us guess that the hidden predicate is <http://schema.org/gender> with high probability.
Alternatively, we can assume that the guesser already has such knowledge via the public credential schema.

Then, if the canonical labeling produces different results depending on the gender value, we can use it to deduce the gender value.
In fact, this example produces different results depending on whether the gender is Female or Male.
(Note: I ignored the other types of gender just for brevity)

The following example shows that "gender = Male" yields different canonical labeling.

# hypothetical normalized dataset
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/familyName> "Jarrett" .
_:c14n0 <http://schema.org/gender> "Male" .  # gender === Male
_:c14n0 <http://schema.org/givenName> "Ali" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .

So the verifier should have obtained the following dataset if gender had the value Male, which differs from the revealed dataset, so the verifier can conclude that the gender is Female.

# hypothetical disclosed dataset
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/familyName> "Jarrett" .
########### 3rd statement is unrevealed ##########
_:c14n0 <http://schema.org/givenName> "Ali" .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PostalAddress> .

Note that we can use the same approach to guess non-boolean values if the range of possible values is still a reasonable (small) size for a guesser to try all the possibilities.

By making the canonicalization process private, we can prevent a brute-forcing attacker from trying to see the labeling change by trying multiple possible attribute values.
For example, we can use a HMAC instead of a hash function in the canonicalization algorithm. Alternatively, we can add a secret random nonce (always undisclosed) into the dataset.
Note that these workarounds force dataset issuers and holders to manage shared secrets securely.
We also note that these workarounds adversely affect the unlinkability described below because canonical labeling now varies depending on the secret shared by the issuer and the holder, which will help correlate them.

b) possible leakage via canonical sorting

The canonical order can leak unrevealed information even without canonical labelings.

Let us assume that the holder has the following signed dataset, sorted in the canonical (code-point) order.

:a <http://schema.org/children> "Albert" .
:a <http://schema.org/children> "Alice" .
:a <http://schema.org/children> "Allie" .
:a <http://schema.org/name> "John Smith" .

If the holder wants to hide the statement for their second child for any reason, the disclosed dataset now looks like this:

:a <http://schema.org/children> "Albert" .
########### 2nd statement is unrevealed ##########
:a <http://schema.org/children> "Allie" .
:a <http://schema.org/name> "John Smith" .

Knowing that these statements are sorted in the canonical order, we can guess that the hidden statement must start with :a <http://schema.org/children> "Al, which leaks the subject (:a), predicate (<http://schema.org/children>) and the first two letters of the object ("Al") in the hidden statement.

To avoid this leakage, the dataset issuer can randomly shuffle the normalized statements before signing and issuing them to the holder, preventing others from guessing undisclosed information from the canonical order.
However, similar to the workarounds mentioned above, this workaround also adversely affects unlinkability. This is because there are $n!$ different permutations for shuffling $n$ statements, and whichever one is used will help correlate the dataset.

c) possible linking via canonical labeling

Unlinkability ensures that no correlatable data are used in a signed dataset while still providing some level of trust, the sufficiency of which must be determined by each verifier. (Note: based on the description in the VC Data Integrity spec)

While canonical sorting works better for unlinkability, canonical labeling can be exploited to break it.
The total number of canonical labelings for a dataset with $n$ blank nodes is $n!$, which is not controllable by the issuer.
It means that the herd constructed as a result of selective disclosure will be split into $n!$ pieces due to the canonical labeling, which reduces unlinkability.

For example, let us assume that an employee of the small company "example.com" shows its employee ID dataset without their name like this:

# disclosed dataset
########### 1st statement is unrevealed ##########
_:c14n0 <http://schema.org/worksFor> _:c14n1 .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/address> _:c14n2 .
_:c14n1 <http://schema.org/geo> _:c14n3 .
_:c14n1 <http://schema.org/name> "example.com" .
_:c14n2 <http://schema.org/addressCountry> "United States" .
_:c14n3 <http://schema.org/latitude> "0.0" .
_:c14n3 <http://schema.org/longitude> "0.0" .

The verifier can always trace this person without knowing their name if this company has only three employees with the following employee ID datasets.

# normalized dataset 1
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/geo> _:c14n3 .
_:c14n0 <http://schema.org/name> "example.com" .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n2 <http://schema.org/name> "Jayden Doe" .
_:c14n2 <http://schema.org/worksFor> _:c14n0 .
_:c14n2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n3 <http://schema.org/latitude> "0.0" .
_:c14n3 <http://schema.org/longitude> "0.0" .

# normalized dataset 2
_:c14n0 <http://schema.org/address> _:c14n1 .
_:c14n0 <http://schema.org/geo> _:c14n2 .
_:c14n0 <http://schema.org/name> "example.com" .
_:c14n1 <http://schema.org/addressCountry> "United States" .
_:c14n2 <http://schema.org/latitude> "0.0" .
_:c14n2 <http://schema.org/longitude> "0.0" .
_:c14n3 <http://schema.org/name> "Morgan Doe" .
_:c14n3 <http://schema.org/worksFor> _:c14n0 .
_:c14n3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .

# normalized dataset 3
_:c14n0 <http://schema.org/name> "Johnny Smith" .
_:c14n0 <http://schema.org/worksFor> _:c14n1 .
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n1 <http://schema.org/address> _:c14n2 .
_:c14n1 <http://schema.org/geo> _:c14n3 .
_:c14n1 <http://schema.org/name> "example.com" .
_:c14n2 <http://schema.org/addressCountry> "United States" .
_:c14n3 <http://schema.org/latitude> "0.0" .
_:c14n3 <http://schema.org/longitude> "0.0" .

The canonicalization in this example produces different labelings for these three employees, which helps anyone to correlate their activities even if they do not reveal their names in the dataset.

By determining some "template" for each anonymous set (or herd) and fixing the canonical labeling and canonical order used in the anonymous set, we can achieve a certain unlinkability.
Alternatively, we might be able to generate a kind of "on-the-fly" template proposed in Dave's comment in #4, which seems ineffective in some cases (see #8).

The text was updated successfully, but these errors were encountered:

dlongley · 2023-04-03T16:59:04Z

I think there is a spectrum here that moves from maximizing unlinkability on one side to minimizing data leakage on the other.

There are "pure" two choices that can be made on the respective extreme sides of the spectrum:

Maximize herd privacy for unlinkable datasets by having a single template and blank node ordering. However, this also maximizes data leakage because the blank node positioning is well known. Recommendation: Do not use sets or lists when using unlinkable credentials because structure will leak information. This recommendation would also apply to any other syntax that uses structural information (e.g., JSON arrays).
Minimize data leakage for selective disclosure datasets by having a totally unique and random blank node ordering per dataset. However, this also minimizes herd privacy (zero), uniquely identifying each dataset. Use this when the dataset would have been uniquely identified anyway (or if linkability is desirable or required).

Algorithms can be devised that would do something in between these two spectrums, such as limiting the possible blank node orderings to something greater than 1 but less than some other number that represents the total number of herds. This functionally appears as N-many different templates of which the issuer chooses and shares in confidence with the holder / prover, who does not share it. This could be implemented algorithmically as well ("templates" could be effectively procedurally generated).

However, it seems as though this kind of complexity for a "middle ground solution" is probably not needed:

In the case that unlinkable datasets are desired, the amount of data being disclosed must remain extremely small (e.g., one to two quads) or else the revealed data itself will break the desired unlinkability. No fancy crypto nor data formatting gymnastics will prevent this.

In the case that linkability is desirable or automatically occurs through the data being revealed, there is no need to build in the complexity of herds. Here the recommendation should be to focus on minimizing data leakage: use a graph-based format that uses randomly generated blank nodes. This should minimize (effectively zero) information leaked through data structure.

peacekeeper · 2023-05-10T15:03:08Z

Thanks @yamdan for volunteering to write a PR based on the content in this issue. This will help the WG to proceed with horizontal review processes.

dlongley mentioned this issue Apr 28, 2023

Blank node labels may leak information w3c/vc-di-bbs#60

Closed

peacekeeper assigned yamdan May 10, 2023

peacekeeper added the ready for pr This issue is ready for a PR to be created. label May 10, 2023

yamdan mentioned this issue May 10, 2023

"Herd-privacy" canonicalization #8

Closed

gkellogg added the ms:CR label May 10, 2023

yamdan mentioned this issue May 12, 2023

Add privacy considerations #99

Closed

yamdan mentioned this issue May 19, 2023

Add more general privacy considerations #101

Merged

philarcher added the HorizontalReview label May 24, 2023

yamdan closed this as completed in #101 May 24, 2023

yamdan mentioned this issue Jun 8, 2023

Change most ednotes to issue markers, and remove some closed issues. #115

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some privacy considerations #84

Some privacy considerations #84

yamdan commented Mar 13, 2023

dlongley commented Apr 3, 2023 •

edited

Loading

peacekeeper commented May 10, 2023

Some privacy considerations #84

Some privacy considerations #84

Comments

yamdan commented Mar 13, 2023

a) possible leakage via canonical labeling

b) possible leakage via canonical sorting

c) possible linking via canonical labeling

dlongley commented Apr 3, 2023 • edited Loading

peacekeeper commented May 10, 2023

dlongley commented Apr 3, 2023 •

edited

Loading