Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialization silently loads broken bom #677

Open
wkoot opened this issue Sep 18, 2024 · 3 comments · May be fixed by #678 or #754
Open

Deserialization silently loads broken bom #677

wkoot opened this issue Sep 18, 2024 · 3 comments · May be fixed by #678 or #754
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested

Comments

@wkoot
Copy link

wkoot commented Sep 18, 2024

It seems that deserialization silently loads broken bom data, omitting components due to deduplication.
For reproduction, refer to the sample testdata output as produced in CycloneDX/cyclonedx-cli#399:

{
  "bomFormat": "CycloneDX",
  "specVersion": "1.5",
  "serialNumber": "urn:uuid:66fa5692-2e9d-45c5-830a-ec8ccaf7dcc9",
  "version": 1,
  "metadata": {
    "component": {
      "type": "application",
      "name": "test"
    }
  },
  "components": [
    {
      "type": "operating-system",
      "bom-ref": "test12",
      "name": "alpine"
    },
    {
      "type": "container",
      "bom-ref": "test11",
      "name": "alpine"
    },
    {
      "type": "operating-system",
      "bom-ref": "test22",
      "name": "alpine"
    },
    {
      "type": "container",
      "bom-ref": "test21",
      "name": "alpine"
    }
  ],
  "dependencies": [
    {
      "ref": "test11",
      "dependsOn": [
        "test12"
      ]
    },
    {
      "ref": "test21",
      "dependsOn": [
        "test22"
      ]
    }
  ]
}

Deserialize and validate bom:

>>> with open('test/out.json', mode="r") as testfile:
...   json_bom = load(testfile)
>>> len(json_bom['components'])
4
>>> len(json_bom['dependencies']) 
2
>>> bom = Bom.from_json(json_bom)
>>> len(bom.components)
2
>>> len(bom.dependencies) 
2
>>> bom.validate()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lib\site-packages\cyclonedx\model\bom.py", line 666, in validate
    raise UnknownComponentDependencyException(
cyclonedx.exception.model.UnknownComponentDependencyException: One or more Components have Dependency references to Components/Services that are not known in this BOM. They are: {<BomRef 'test22' id=2111773432208>, <BomRef 'test21' id=2111773432160>}

Because Bom.from_json() doesn't throw an error, there's no telling what data is actually lost during deserialization.

@jkowalleck jkowalleck added the bug Something isn't working label Sep 18, 2024
@jkowalleck
Copy link
Member

@wkoot, do you want to give it a try fixing this?

@wkoot
Copy link
Author

wkoot commented Sep 18, 2024

I believe that making the bom-ref a part of __hash__ and therefore the equality operator would solve this.

But this boils down to an architectural choice regarding model strictness; how much deduplication should be enforced?
It is apparent that some tools (including cyclonedx-cli) produce CycloneDX sbom json files which are lax in this regard.
By including the bom-ref in the object hash strictness could be traded for practicality, saving a lot of hassle.

I think there are three options:

  1. Throw an error when loading invalid data and hope to convince other tools to validate their output
  2. Make this library (and perhaps the spec? I don't know if it's in there) lax and make hash include bom-ref
  3. Implement a strictness setting functioning as a switch between hash calcs with or without bom-ref

The first option might be an uphill battle, while the other two will most likely mean that other tools will not be changed.
However, perhaps it isn't very likely that other tools will be changed - that might depend on what's in the spec exactly.
Option three is probably the worst, as it would both complicate implementation but still remove incentive for other tools.

It would be easier if there was a working merge or deduplicate implementation that could be referred to :-)
I also found that such an operation can be rather expensive on my dataset of ~30 sbom files (totalling about 18 MB raw json). Running cyclonedx-cli merge takes 4-6 minutes, a correctly deduplicated merge on the same dataset takes about 40 minutes.

@jkowalleck
Copy link
Member

jkowalleck commented Sep 19, 2024

I believe that making the bom-ref a part of __hash__ and therefore the equality operator would solve this.

You could add your example to this very project's test cases and alter the code to showcase this.
Feel free to pullrequest a working example.

I think there are three options:

  1. Throw an error when loading invalid data and hope to convince other tools to validate their output
  2. Make this library (and perhaps the spec? I don't know if it's in there) lax and make hash include bom-ref
  3. Implement a strictness setting functioning as a switch between hash calcs with or without bom-ref

re:

  1. throwing an error: nope, deserialization is best-effort and tries hard to generate data structures in a very forgiving way.(maybe throwing warnings can be a thing)
  2. [...] lax and make hash include bom-ref: feel free to open a pullrequest that includes tests showcasing this helps your issue.
  3. Implement a strictness setting: feel free to open a pullrequest that includes tests showcasing this helps your issue.

It would be easier if there was a working merge or deduplicate implementation that could be referred to :-)

There is just no agreement/consensus on the merging algorithms - see CycloneDX/specification#320.
Such an implementation would not be part of the core library for a long time, until it has proven general value and overall correctness. (Many implemented their own/custom merge approach fitting their needs, utilizing this very library - you may do so, too)

@jkowalleck jkowalleck added help wanted Extra attention is needed question Further information is requested labels Sep 19, 2024
wkoot added a commit to wkoot/cyclonedx-python-lib that referenced this issue Sep 19, 2024
@wkoot wkoot linked a pull request Sep 19, 2024 that will close this issue
wkoot added a commit to wkoot/cyclonedx-python-lib that referenced this issue Sep 19, 2024
wkoot added a commit to wkoot/cyclonedx-python-lib that referenced this issue Sep 19, 2024
wkoot added a commit to wkoot/cyclonedx-python-lib that referenced this issue Sep 19, 2024
wkoot added a commit to wkoot/cyclonedx-python-lib that referenced this issue Sep 19, 2024
@jkowalleck jkowalleck linked a pull request Jan 17, 2025 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested
Projects
None yet
2 participants