-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT(capa2sarif) Add SARIF conversion script from json output #2093
Conversation
…put to sarif schema, update dependencies, and update changelog
Clean up from this PR #2036 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good!
Thanks for taking the time to introduce us to SARIF and provide the script. The logic looks reasonable, and aside from some nits that I noticed, I don't see any reason not to merge this soon.
One idea: rather than interacting with the capa JSON, you might want to deserialize it into the ResultDocument format that capa provides. This has full type hints that mypy checks, whereas the JSON document doesn't have any codified schema. Therefore, if we ever change the JSON document, we'd only notice bugs when this script breaks. By using the type checked ResultDocument, we can catch that with static analysis tools. That being said, I recongize this would take you a bit more work, so I understand if you can't make the changes now. We can do it at the first bug ;-)
recommend also adding a trivial test to |
These are reasonable will work on adding today, thanks! |
…plied auto formatter for styling
…ipt using existing result document
This should address the above suggestions, thanks again! I am not sure on the test, but using an existing result document seems to be a good testcase (granted this would NOT catch breaking changes if JSON changes over time). I am not sure a good way to take input file, run capa, run script after currently, I think this current approach is wrong though. |
i think you could use any of the json files in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
please resolve merge conflicts and then i'll merge! |
thank you @ReversingWithMe! |
…nt#2093) * feat(capa2sarif): add new sarif conversion script converting json output to sarif schema, update dependencies, and update changelog * fix(capa2sarif): removing copy and paste transcription errors * fix(capa2sarif): remove dependencies from pyproject toml to guarded import statements * chore(capa2sarif): adding node in readme specifying dependency and applied auto formatter for styling * style(capa2sarif): applied import sorting and fixed typo in invocations function * test(capa2sarif): adding simple test for capa to sarif conversion script using existing result document * style(capa2sarif): fixing typo in version string in usage * style(capa2sarif): isort failing due to reordering of typehint imports * style(capa2sarif): fixing import order as isort on local machine was not updating code --------- Co-authored-by: ReversingWithMe <[email protected]> Co-authored-by: Willi Ballenthin <[email protected]>
SARIF gets you navigation for binary beacons from capa in any tool that supports SARIF(e.g., ghidra/radare/ida). I expect this to be a core format for binary analysis in the future.
The Static Analysis Results Interchange Format (SARIF) is a standardized format for the output of static analysis tools, which are used to evaluate source or binary for things like vulnerabilities or dataflow. SARIF enables different analysis tools to produce results in a common format that can be easily understood, integrated, and acted upon by software development tools and systems. E.g. vscode, ghidra, radare2, and github all adopt a common standard for representing types of information.
SARIF describes: the analysis being ran and results from an analysis on an artifact. Results include description of artifacts related to a run of the tool where artifact is source code, binary file, and auxiliary data files. Results also include the invocation or how the tool was run, including version, command line, any knobs/parameters. The idea being you can reconstruct where output data came from foe things that depend on parameters on specific input. Results themselves are captured via "rules" where it is some type of analysis, one could imagine a single rule identifier for all of capa, but that wouldn't be very useful. For each rule/type of information, there is a single message for the finding as well as a property bag which you can shove anything into.
This PR adds a new script that takes in a CAPA output file (~7.0) and converts the json to SARIF (a JSON with additional schema). This is a clean start from a previous PR to clean up branch history from embedding this as argument flags in capa directly. Potentially if this feature gets enough usage and is stable enough, adding a specific renderer is desired, but that may prefer doing natively instead of 3rd party deps.
This includes additional features for Radare specific and Ghidra specific current requirements. I expect both of these to get fixed over time.
Steps to test functionality
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -e .[dev]
git submodule init
git submodule update
capa --json tests/data/5d7c34b6854d48d3da4f96b71550a221.exe_ > capa_result.json
python3 -m json.tool capa_result.json
// test json compliancepython3 scripts/capa2sarif.py capa_result.json -r > capa_radare.sarif
python3 -m json.tool capa_radare.sarif
// test json compliancer2 tests/data/5d7c34b6854d48d3da4f96b71550a221.exe_
> sarif -i capa_radare.sarif
> sarif -l
In ghidra, similar but -g instead of -r. Enable SARIF extension from install extensions. Sarif > Read File > capa_ghidra.sarif
Interactive table spawns
Checklist