Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDF Canonicalization #855

Closed
philarcher opened this issue Jun 9, 2023 · 4 comments
Closed

RDF Canonicalization #855

philarcher opened this issue Jun 9, 2023 · 4 comments
Assignees
Labels
Resolution: satisfied The TAG is satisfied with this design security-tracker Group bringing to attention of security, or tracked by the security Group but not needing response. Topic: Data Topic: security features Venue: RDF Canonicalization WG

Comments

@philarcher
Copy link

こんにちは TAG-さん!

I'm requesting a TAG review of RDF Data Canonicalization.

There are a variety of use cases that depend on the ability to calculate a unique and deterministic hash value of RDF Datasets, such as Verifiable Credentials, the publication of biological and pharmaceutical data, or consumption of mission critical RDF vocabularies that depend on the ability to verify the authenticity and integrity of the data being consumed. See the use cases for more examples. These use cases require a standard way to process the underlying graphs contained in RDF Datasets that is independent of the serialization itself.

  • An explainer was created to support the WG's charter. The current draft of the specification 2023-06-09 indicates that we plan to link to the explainer document but also to augment that section of the spec with further detail that will cover aspects that have come to light as the spec has evolved.
  • Specification URL: https://www.w3.org/TR/2023/WD-rdf-canon-20230609/
  • Tests: are at https://w3c.github.io/rdf-canon/tests/
  • Current implementations are listed at https://github.com/w3c/rdf-canon/wiki/List-of-available-implementations
  • User research: [url to public summary/results of research] N/A
  • Security and Privacy self-review²: Security and privacy considerations w3c/rdf-canon#70 (reviews have been requested simultaneously with this request to the TAG)
  • GitHub repo (if you prefer feedback filed there): TAG Review w3c/rdf-canon#118 please
  • Primary contacts (and their relationship to the specification):
    • Greg Kellogg (gkellogg), [Invited Expert] (editor)
    • Dave Longley (dlongley), [Digital Bazaar] (editor)
    • Dan Yamamoto (yamdan), [Invited Expert] (editor)
    • Phil Archer (philarcher), [GS1] (WG co-chair)
    • Markus Sabadello (peacekeeper), [Danube Tech] (WG co-chair)
  • Organization(s)/project(s) driving the specification: Although not exclusively about Verifiable Credentials, that technology is a major driver and there is a lot of overlap in personnel in that group.
  • Key pieces of existing multi-stakeholder review or discussion of this specification - please note the extensive list of existing implementations.
  • External status/issue trackers for this specification (publicly visible, e.g. Chrome Status): N/A

Further details:

  • [✓] I have reviewed the TAG's Web Platform Design Principles
  • Relevant time constraints or deadlines: We hope to go to CR in July or August at the latest, i.e. before TPAC. The VCWG's work on ECDSA has a dependency on RDF Dataset Canonicalization
  • The group where the work on this specification is currently being done: RDF Dataset Canonicalization and Hash
  • The group where standardization of this work is intended to be done (if current group is a community group or other incubation venue): N/A
  • Major unresolved issues with or opposition to this specification: There are open issues at this time but no disputes
  • This work is being funded by:

You should also know that...

The spec has a long history and has implementations using the original version in production software.

We'd prefer the TAG provide feedback as (please delete all but the desired option):

💬 leave review feedback as a comment in this issue and @-notify gkellogg, dlongley, yamdan, philarcher, peacekeeper.

@hadleybeeman hadleybeeman self-assigned this Jun 15, 2023
@rhiaro rhiaro self-assigned this Jun 15, 2023
@torgo torgo added Topic: Data security-tracker Group bringing to attention of security, or tracked by the security Group but not needing response. and removed Progress: untriaged labels Jun 15, 2023
@torgo torgo added this to the 2023-06-19-week milestone Jun 15, 2023
@torgo torgo modified the milestones: 2023-06-19-week, 2023-07-03 week Jul 3, 2023
@rhiaro
Copy link
Contributor

rhiaro commented Aug 3, 2023

Hi @gkellogg @dlongley @yamdan @philarcher @peacekeeper

We (@hadleybeeman and I) reviewed this in our virtual face-to-face this week. We like the direction of the work, and the design is sensible.

We noticed you haven't yet filled out the privacy and security questionnaire. Understanding that not all of the questions may be relevant, please could you do this?

Also, we see that you are using quads instead of triples and adding in the graph name once? It sounds more complex — but we suspect you have considered this at length. We are just interested in your thought process here. (This is the sort of thing we normally expect to see in an explainer.)

Also, we'd love to see the explainer when you've updated your explainer to bring it in line with the spec.

And finally, what happens if the hashing algorithm becomes insecure? It might be helpful to put a comment in the security considerations section to advise implementers in the future to consider that possibility.

@gkellogg
Copy link

gkellogg commented Aug 3, 2023

Thanks @rhiaro, we'll need to take this up in the WG.

As for the use of quads vs. triples, note that this a spec for datasets, not just graphs, so the graph name component is necessary for recording this information. Use cases including Verifiable Credentials depend on the use of datasets, and not just graphs, so canonicalizing the entire dataset is important. Algorithmically, including the graph name as a potential location for a blank node in addition to the subject and object positions has a fairly minor impact.

Although RDF Concepts suggests an interpretation of a set of graphs, all but one of which can have a graph name, it is fully consistent with the N-Quads representation which is convenient for the algorithm. A hypothetical variation might have created a hash for each graph and then hash the graph name/graph hash pairs, but it remains necessary to consider that blank nodes may appear across graphs, and indeed as the graph name, so it doesn't really change the need to consider blank nodes across the dataset and not just within each graph.

Good point about noting the implications on the algorithm for some potential future vulnerability. Note that there is text to indicate that the algorithm can be use with different hashing algorithms with minimal change

NOTE
Implementations can be written to parameterize the hash algorithm without any other changes. However, using a different hash algorithm is expected to generate different output from RDFC-1.0.

However, the security issues that might motivate this can be better highlighted.

@philarcher
Copy link
Author

@rhiaro, @hadleybeeman
We have added text related to potential for hash algorithms to be shown to be insecure Markus's addition above can be seen as a short section at https://www.w3.org/TR/rdf-canon/#insecure-hash-algorithms). A further addition concerning use of alternative has mechanisms is in preparation (w3c/rdf-canon#161).

Meanwhile, we have been through the P&S questionnaire and offer the following responses.

As an overall comment, RDF Dataset Canonicalization takes an RDF dataset as input and returns a different form of the same dataset as output (unless the input is already canonicalized - the process is idempotent). The questionnaire is well-suited to highlighting potential security and privacy issues with Web applications running in browsers. As our specification only specifies an algorithm for handling data, many of the questions don’t apply to our work.

Implementations may interact with the Web, of course, but such interactions are not specified in the document and are therefore out of scope. That said, the privacy and security considerations sections of the document highlight issues of which any implementation should be aware.

2.1 What information might this feature expose to Web sites or other parties, and for what purposes is that exposure necessary?

The document defines an algorithm that canonicalizes an RDF dataset. It does not introduce or remove any information from the dataset, and does not expose any new information.

2.2 Do features in your specification expose the minimum amount of information necessary to enable their intended uses?

Yes. The specification defines an algorithm that canonicalizes whatever data is given. The output from the algorithm includes canonicalized identifiers for blank nodes that are produced from the input. New information that wasn’t in the dataset being processed isn’t introduced.

2.3 How do the features in your specification deal with personal information, personally-identifiable information (PII), or information derived from them?

The algorithm canonicalizes any data given to it. Decisions on handling personally identifiable information are up to the application. Therefore these issues, while obviously important, are out of scope for the draft standard.

2.4 How do the features in your specification deal with sensitive information?

See previous answer. Data is only used internally within the application. How any sensitive data is handled is up to the implementation.

2.5 Do the features in your specification introduce new state for an origin that persists across browsing sessions?

No.

2.6 Do the features in your specification expose information about the underlying platform to origins?

No.

2.7 Does this specification allow an origin to send data to the underlying platform?

No.

2.8 Do features in this specification enable access to device sensors?

No.

2.9 Do features in this specification enable new script execution/loading mechanisms?

No.

2.10 Do features in this specification allow an origin to access other devices?

No.

2.11 Do features in this specification allow an origin some measure of control over a user agent’s native UI?

No.

2.12 What temporary identifiers do the features in this specification create or expose to the web?

None. While the specification defines an algorithm that transforms identifiers, the algorithm itself does not expose these to the web. It is up to the application that uses the algorithm to decide whether or how to expose any output from the algorithm.

2.13 How does this specification distinguish between behavior in first-party and third-party contexts?

It does not. The specification defines a canonicalization algorithm that internally rearranges input data to output data. It is up to the application to feed data into the algorithm and use whatever its outputs are.

2.14 How do the features in this specification work in the context of a browser’s Private Browsing or Incognito mode?

This is out of scope. The specification defines an algorithm that can be run in whatever context the application decides to run it in and the algorithm only rearranges input data into a canonical form. Whether the application runs in a browser at all is not defined by this spec.

2.15 Does this specification have both "Security Considerations" and "Privacy Considerations" sections?

Yes. Privacy considerations. Security considerations.

2.16 Do features in your specification enable origins to downgrade default security protections?

No.

2.17 How does your feature handle non-"fully active" documents?

It does not, this is out of scope for a canonicalization algorithm. The canonicalization algorithm works on RDF datasets which are unrelated to non-”fully active” documents.

2.18 What should this questionnaire have asked?

As noted in the preamble, the questionnaire focuses on browsers/Web apps. It does not target the needs of data representation formats, so it is not particularly useful for a whole category of specifications. This might be useful feedback for the privacy group in the long term to add questions to cover more specifications.

@rhiaro rhiaro added the Progress: propose closing we think it should be closed but are waiting on some feedback or consensus label Sep 11, 2023
@rhiaro
Copy link
Contributor

rhiaro commented Oct 25, 2023

So sorry for the delay in closing this, we thought we already had! We're happy to see this go forward, and thanks for your detailed responses to our questions.

@rhiaro rhiaro closed this as completed Oct 25, 2023
@rhiaro rhiaro added Resolution: satisfied The TAG is satisfied with this design and removed Progress: propose closing we think it should be closed but are waiting on some feedback or consensus labels Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: satisfied The TAG is satisfied with this design security-tracker Group bringing to attention of security, or tracked by the security Group but not needing response. Topic: Data Topic: security features Venue: RDF Canonicalization WG
Projects
None yet
Development

No branches or pull requests

5 participants