Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize the JSON format #101

Closed
munntjlx opened this issue Dec 11, 2023 · 9 comments
Closed

Stabilize the JSON format #101

munntjlx opened this issue Dec 11, 2023 · 9 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request reporting Related to reporting of findings

Comments

@munntjlx
Copy link
Contributor

Is your feature request related to a problem? Please describe.
We are using noseyparker as part of a 'defect dojo' secret tracking and discovery process. We use the jsonl format, but it seems to change arbitrarily between even minor versions. This causes us to have to modify our parser scripts (for defect dojo) since our unit tests fail when the JSON format changes.

Describe the solution you'd like
Can we stabilize the json format or agree to only change it on x versions? This would make derivitave works or programs having a 'stable' base upon which to build.

Describe alternatives you've considered
Perhaps 'we change the json format every .2 or .4 increments of a new version? Or an 'odd vs 'even' which gives us some stability on the JSON format?

Additional context
Just to note this is a GREAT project and we really appreciate the work you are doing!

@munntjlx munntjlx added the enhancement New feature or request label Dec 11, 2023
@bradlarsen bradlarsen added documentation Improvements or additions to documentation reporting Related to reporting of findings labels Dec 11, 2023
@bradlarsen
Copy link
Collaborator

Thanks for the request @munntjlx. Yes, this is a worthy task. See #72 for related.

I have tried with mixed success up to this point to only add fields to the JSON format. With a lenient parser — one that ignores unknown additional fields — the core information should still be parseable in the presence of additions, without having to change the parser.

It may take another release or three, but this is in my plans. (Relatedly, I'd like to also stabilize the SQL schema for the datastore, but that will likely take longer.)

@bradlarsen
Copy link
Collaborator

P.S. @munntjlx out of curiosity, are your defect dojo parser scripts proprietary, or are they part of the DefectDojo project?

@munntjlx
Copy link
Contributor Author

munntjlx commented Dec 11, 2023 via email

@bradlarsen
Copy link
Collaborator

Anyway, aside from trying hard to not change the JSON output between releases, the place to start with stabilization is to write a schema and validate against that (at least in testing): #72.

@bradlarsen
Copy link
Collaborator

@munntjlx there are more changes to the JSON report coming in the next release (#122). Some fundamentals have changed in the Nosey Parker data model that require some visible changes to the JSON report format.

I'm hoping that these changes are the last major ones, and that future format changes will be infrequent and involve only adding new fields.

I'm also hoping to put together a JSON Schema definition of the report format for the next release. This should help both with documentation of the format and also help with identifying changes of that format.

bradlarsen added a commit that referenced this issue Feb 17, 2024
This is a big PR that makes a number of significant changes to the Nosey Parker data model.

- The minimum supported Rust version has been changed from 1.70 to 1.76.

- The data model and datastore have been significantly overhauled:

  - The rules used during scanning are now explicitly recorded in the datastore.
    Each rule is additionally accompanied by a content-based identifier that uniquely identifies the rule based on its pattern.

  - Each match is now associated with the rule that produced it, rather than just the rule's name (which can change as rules are modified).

  - Each match is now assigned a unique content-based identifier.

  - Findings (i.e., groups of matches with the same capture groups, produced by the same rule) are now represented explicitly in the datastore.
    Each finding is assigned a unique content-based identifier.

  - Now, each time a rule matches, a single match object is produced.
    Each match in the datastore is now associated with an array of capture groups.
    Previously, a rule whose pattern had multiple capture groups would produce one match object for each group, with each one being associated with a single capture group.

  - Provenance metadata for blobs is recorded in a much simpler way than before.
    The new representation explicitly records file and git-based provenance, but also adds explicit support for _extensible_ provenance.
    This change will make it possible in the future to have Nosey Parker scan and usefully report blobs produced by custom input data enumerators (e.g., a Python script that lists files from the Common Crawl WARC files).

  - Scores are now associated with matches instead of findings.

  - Comments can now be associated with both matches and findings, instead of just findings.

- The JSON and JSONL report formats have changed.
  These will stabilize in a future release ([#101](#101)).

  - The `matching_input` field for matches has been removed and replaced with a new `groups` field, which contains an array of base64-encoded bytestrings.

  - Each match now includes additional `rule_text_id`, `rule_structural_id`, and `structural_id` fields.

  - The `provenance` field of each match is now slightly different.

- Schema migration of older Nosey Parker datastores is no longer performed.
  Previously, this would automatically and silently be done when opening a datastore from an older version.
  Explicit support for datastore migration may be added back in a future release.
@munntjlx
Copy link
Contributor Author

munntjlx commented Feb 20, 2024 via email

@bradlarsen
Copy link
Collaborator

Describe the solution you'd like
Can we stabilize the json format or agree to only change it on x versions? This would make derivitave works or programs having a 'stable' base upon which to build.

Possible changes to the JSON format will become less frequent as Nosey Parker becomes more mature.

Though I'm not going to promise not to change the JSON format at this point, I will provide an updated JSON schema and an announcement in the release notes when this does happen. The JSON schemas between versions could be diffed to understand what has changed. (You can get the JSON schema from the v0.17.0 releases, or using the new noseyparker generate json-schema command.)

I don't have any changes that I want to make to the JSON format at the moment. I suspect that future modifications to it would be in the form of additional data rather than renaming fields or changing its organization.

@munntjlx
Copy link
Contributor Author

munntjlx commented Mar 8, 2024

Our Noseyparker decoder has been added to Defect Dojo Main Project. Probably need to get it beyond v0.16

@munntjlx
Copy link
Contributor Author

munntjlx commented Mar 8, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request reporting Related to reporting of findings
Projects
None yet
Development

No branches or pull requests

2 participants