Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Validate Format functionality #580

Merged
merged 5 commits into from
May 21, 2024
Merged

Conversation

alisonlomaka
Copy link
Member

Implement a format validation service in support of redacting the Files section from an SPDX 2.x format SBOM.

This is a naive implementation that does not leverage the JSON object streaming mechanism that is used for drop validation. The reason for this is that it is fairly complex to amend that logic to customize the validation behavior. Since we are operating on a very tight timeline, the preliminary redaction implementation will rely on deserializing the entire JSON file at once, on the assumption that will cover a large proportion of our in-the-wild use cases. A follow-on investigation will be conducted to determine the practical limits of this technique and necessity for extending to support larger SBOMs.

The criteria for a "valid" SBOM include:

  • well-formed JSON document
  • includes required SPDX elements
  • SPDX v2.x
  • Packages and Relationships sections present

Validating SBOMs for SPDX elements covering the NTIA definition of an SBOM is out of scope for this change. Such validation may not be practical when processing 3P SBOMs since an SBOM creator could choose to use different attributes to satisfy the NTIA requirements.

In context of redaction, we will skip deserializing the files section (since removing it is the purpose of redaction anyway - this is also expected to reduce the size of an SBOM by ~50%). We will also require Packages and Relationships, since we need to operate on those in order to successfully redact. These ignore/require expectations are set at runtime but currently hardcoded; making them configurable is out of scope for this feature but definitely feasible.

The JSON validation and SPDX required elements are explicitly and implicitly enforced by the JSON model definition and deserialization. This also implicitly enforces the SPDX version since v3 is so different in format that the deserialization would fail before we ever got to a version check. Nonetheless, we make our expectations explicit by also parsing and verifying the spdxVersion value.

Expanding support for all SPDX 2.x (previously only SPDX 2.2.1) was in scope for this feature, and was accomplished by eliminating the use of enums for deserialization (an earlier PR) and adding to the JSON model all properties found in the SPDX 2.3 exemplar document (https://github.com/spdx/spdx-spec/blob/development/v2.3.1/examples/SPDXJSONExample-v2.3.spdx.json). SPDX 2.3 is backwards compatible with earlier 2.x versions. Note that this expanded SPDX version support only applies to the Validate Format and Redact functionality, not to Generate or Drop Validation. Adding all documented properties to the model also ensures that we do not lose data when deserializing/re-serializing after redaction.

…for a specified file. This is a hidden verb that I imagine being used during development, and in future by DRI to explore an SBOM or troubleshoot a validation failure.
…es section from an SPDX 2.x format SBOM.

This is a naive implementation that does *not* leverage the JSON object streaming mechanism that is used for drop validation. The reason for this is that it is fairly complex to amend that logic to customize the validation behavior. Since we are operating on a very tight timeline, the preliminary redaction implementation will rely on deserializing the entire JSON file at once, on the assumption that will cover a large proportion of our in-the-wild use cases. A follow-on investigation will be conducted to determine the practical limits of this technique and necessity for extending to support larger SBOMs.

The criteria for a "valid" SBOM include:
- well-formed JSON document
- includes required SPDX elements
- SPDX v2.x

Validating SBOMs for SPDX elements covering the NTIA definition of an SBOM is out of scope for this change. Such validation may not be practical when processing 3P SBOMs since an SBOM creator could choose to use different attributes to satisfy the NTIA requirements.

In context of redaction, we will skip deserializing the files section (since removing it is the purpose of redaction anyway - this is also expected to reduce the size of an SBOM by ~50%). We will also require Packages and Relationships, since we need to operate on those in order to successfully redact. These ignore/require expectations are set at runtime but currently hardcoded; making them configurable is out of scope for this feature but definitely feasible.

The JSON validation and SPDX required elements are explicitly and implicitly enforced by the JSON model definition and deserialization. This also implicitly enforces the SPDX version since v3 is so different in format that the deserialization would fail before we ever got to a version check. Nonetheless, we make our expectations explicit by also parsing and verifying the spdxVersion value.

Expanding support for all SPDX 2.x was in scope for this feature, and was accomplished by eliminating the use of enums for deserialization (an earlier change) and adding to the JSON model all properties found in the SPDX 2.3 exemplar document (https://github.com/spdx/spdx-spec/blob/development/v2.3.1/examples/SPDXJSONExample-v2.3.spdx.json). SPDX 2.3 is backwards compatible with earlier 2.x versions. Note that this expanded SPDX version support *only* applies to the Validate Format and Redact functionality, *not* to Generate or Drop Validation. Adding all documented properties to the model also ensures that we do not lose data when deserializing/re-serializing after redaction.
…es section from an SPDX 2.x format SBOM.

This is a naive implementation that does *not* leverage the JSON object streaming mechanism that is used for drop validation. The reason for this is that it is fairly complex to amend that logic to customize the validation behavior. Since we are operating on a very tight timeline, the preliminary redaction implementation will rely on deserializing the entire JSON file at once, on the assumption that will cover a large proportion of our in-the-wild use cases. A follow-on investigation will be conducted to determine the practical limits of this technique and necessity for extending to support larger SBOMs.

The criteria for a "valid" SBOM include:
- well-formed JSON document
- includes required SPDX elements
- SPDX v2.x

Validating SBOMs for SPDX elements covering the NTIA definition of an SBOM is out of scope for this change. Such validation may not be practical when processing 3P SBOMs since an SBOM creator could choose to use different attributes to satisfy the NTIA requirements.

In context of redaction, we will skip deserializing the files section (since removing it is the purpose of redaction anyway - this is also expected to reduce the size of an SBOM by ~50%). We will also require Packages and Relationships, since we need to operate on those in order to successfully redact. These ignore/require expectations are set at runtime but currently hardcoded; making them configurable is out of scope for this feature but definitely feasible.

The JSON validation and SPDX required elements are explicitly and implicitly enforced by the JSON model definition and deserialization. This also implicitly enforces the SPDX version since v3 is so different in format that the deserialization would fail before we ever got to a version check. Nonetheless, we make our expectations explicit by also parsing and verifying the spdxVersion value.

Expanding support for all SPDX 2.x was in scope for this feature, and was accomplished by eliminating the use of enums for deserialization (an earlier change) and adding to the JSON model all properties found in the SPDX 2.3 exemplar document (https://github.com/spdx/spdx-spec/blob/development/v2.3.1/examples/SPDXJSONExample-v2.3.spdx.json). SPDX 2.3 is backwards compatible with earlier 2.x versions. Note that this expanded SPDX version support *only* applies to the Validate Format and Redact functionality, *not* to Generate or Drop Validation. Adding all documented properties to the model also ensures that we do not lose data when deserializing/re-serializing after redaction.
@alisonlomaka alisonlomaka requested a review from a team as a code owner May 21, 2024 14:45
@alisonlomaka alisonlomaka requested review from jalkire and edgarrs May 21, 2024 14:45
@codecov-commenter
Copy link

codecov-commenter commented May 21, 2024

Codecov Report

Attention: Patch coverage is 77.77778% with 46 lines in your changes are missing coverage. Please review.

Project coverage is 59.36%. Comparing base (1d832e0) to head (c1e6891).

Files Patch % Lines
...icrosoft.Sbom.Api/FormatValidator/ValidatedSBOM.cs 61.38% 36 Missing and 3 partials ⚠️
...arsers.Spdx22SbomParser/Utils/SPDXVersionParser.cs 62.50% 4 Missing and 2 partials ⚠️
...s.Spdx22SbomParser/Entities/FormatEnforcedSPDX2.cs 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #580      +/-   ##
==========================================
+ Coverage   58.81%   59.36%   +0.54%     
==========================================
  Files         254      266      +12     
  Lines        7894     8101     +207     
  Branches      922      947      +25     
==========================================
+ Hits         4643     4809     +166     
- Misses       2834     2870      +36     
- Partials      417      422       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Remove unreferenced objects from FormatValidationService
Modify MultilineSummary method to initialize the SBOM if not yet initialized. Since the underlying document is lazy-loaded, all public methods should call Initialize before doing their work.
@alisonlomaka alisonlomaka requested a review from sfoslund May 21, 2024 18:55
@alisonlomaka alisonlomaka merged commit b005f3f into main May 21, 2024
6 checks passed
@alisonlomaka alisonlomaka deleted the alisonl/validateFormatPart2 branch May 21, 2024 18:57
tarun06 pushed a commit to tarun06/sbom-tool that referenced this pull request Jul 21, 2024
* Wire up a validate-format verb that runs the format validation logic for a specified file. This is a hidden verb that I imagine being used during development, and in future by DRI to explore an SBOM or troubleshoot a validation failure.

* Fix PR comment - removing duplicated parameter validation code.

* Implement a format validation service in support of redacting the Files section from an SPDX 2.x format SBOM.

This is a naive implementation that does *not* leverage the JSON object streaming mechanism that is used for drop validation. The reason for this is that it is fairly complex to amend that logic to customize the validation behavior. Since we are operating on a very tight timeline, the preliminary redaction implementation will rely on deserializing the entire JSON file at once, on the assumption that will cover a large proportion of our in-the-wild use cases. A follow-on investigation will be conducted to determine the practical limits of this technique and necessity for extending to support larger SBOMs.

The criteria for a "valid" SBOM include:
- well-formed JSON document
- includes required SPDX elements
- SPDX v2.x

Validating SBOMs for SPDX elements covering the NTIA definition of an SBOM is out of scope for this change. Such validation may not be practical when processing 3P SBOMs since an SBOM creator could choose to use different attributes to satisfy the NTIA requirements.

In context of redaction, we will skip deserializing the files section (since removing it is the purpose of redaction anyway - this is also expected to reduce the size of an SBOM by ~50%). We will also require Packages and Relationships, since we need to operate on those in order to successfully redact. These ignore/require expectations are set at runtime but currently hardcoded; making them configurable is out of scope for this feature but definitely feasible.

The JSON validation and SPDX required elements are explicitly and implicitly enforced by the JSON model definition and deserialization. This also implicitly enforces the SPDX version since v3 is so different in format that the deserialization would fail before we ever got to a version check. Nonetheless, we make our expectations explicit by also parsing and verifying the spdxVersion value.

Expanding support for all SPDX 2.x was in scope for this feature, and was accomplished by eliminating the use of enums for deserialization (an earlier change) and adding to the JSON model all properties found in the SPDX 2.3 exemplar document (https://github.com/spdx/spdx-spec/blob/development/v2.3.1/examples/SPDXJSONExample-v2.3.spdx.json). SPDX 2.3 is backwards compatible with earlier 2.x versions. Note that this expanded SPDX version support *only* applies to the Validate Format and Redact functionality, *not* to Generate or Drop Validation. Adding all documented properties to the model also ensures that we do not lose data when deserializing/re-serializing after redaction.

* Sanitize test strings to remove PII
Remove unreferenced objects from FormatValidationService
Modify MultilineSummary method to initialize the SBOM if not yet initialized. Since the underlying document is lazy-loaded, all public methods should call Initialize before doing their work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants