Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize cache and extractor interface #1470

Merged
merged 18 commits into from
Jun 13, 2023

Conversation

stevemk14ebr
Copy link
Contributor

@stevemk14ebr stevemk14ebr commented May 3, 2023

These changed were discussed with @mike-hunhoff

Previously the CapaRuleGenFeatureCache accepted a list of function handles and then internally invoked _find_function_and_below_features on that list so that later calls such as get_all_function_features could read out cached features. Unfortunately, knowing the list of functions you want features out of is not always easy to know before hand (input is from UI, exploring recursively, etc), and in those cases you have to construct a new CapaRuleGenFeatureCache each time per function you'd like features from. This was a problem, because _find_global_features and _find_file_features() was also called in the cache constructor - with _find_file_features() in particular having non-trivial overhead. When you must continously extract features out of many functions the combined overhead of calling _find_file_features in the constructor and throwing away the function cache multiple times generates significant overhead in some applications.

The interface has therefore been changed so that functions such as get_all_function_features now use a helper _get_cached_func_node which will return cached function results if they exist OR populate the cache for that function at call time. A user of this interface can now construct CapaRuleGenFeatureCache one time at a higher global scope, where expensive _find_global_features and _find_file_features() will occur and call get_all_function_features as needed later.

This simplifies the tracking of the extractor and cache states as well in the form.py since now there is only one cache instance and it's re-used everywhere.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

@github-actions github-actions bot dismissed their stale review May 3, 2023 17:48

CHANGELOG updated or no update needed, thanks! 😄

Copy link
Collaborator

@mike-hunhoff mike-hunhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @stevemk14ebr - good improvements here! I've left comments for your review.

capa/ida/plugin/form.py Outdated Show resolved Hide resolved
capa/ida/plugin/form.py Show resolved Hide resolved
capa/ida/plugin/form.py Outdated Show resolved Hide resolved
@stevemk14ebr
Copy link
Contributor Author

stevemk14ebr commented May 16, 2023

By delaying construction of the rulegen extractor and cache the one time file analysis is performed everytime the analyze button is clicked by the user in the rulegen dialog. This is moderately slow. Agree that this saves the user from the file analysis overhead if they don't use the rulegen feature, but it does slow down the rulegen feature for every new function it is used for.

@stevemk14ebr
Copy link
Contributor Author

This new commit fixes the issue mentioned above. I construct specifically just the rulegen extractor and cache when the tab is changed to the rulegen feature. This makes analyze on each new feature fast for the rule generator.

@mike-hunhoff
Copy link
Collaborator

By delaying construction of the rulegen extractor and cache the one time file analysis is performed everytime the analyze button is clicked by the user in the rulegen dialog. This is moderately slow. Agree that this saves the user from the file analysis overhead if they don't use the rulegen feature, but it does slow down the rulegen feature for every new function it is used for.

Totally agree with the points you make here. My intent is for the initialization of the rulegen extractor and cache to only execute once at the beginning of load_capa_function_results, similar to how we handle the cached capa rule set:

def load_capa_function_results(self):
""" """
if self.rulegen_ruleset_cache is None:
# only reload rules if cache is empty
self.rulegen_ruleset_cache = self.load_capa_rules()
else:
logger.info("Using cached capa rules, click Clear to load rules from disk.")

So from a UX perspective the one-time final analysis kicks off the first time a user clicks the Anlayze button in the Rule Generator tab. This overhead then doesn't exist for subsequent clicks of the Anlayze button in the Rule Generator tab.

What are you thoughts @stevemk14ebr ?

@stevemk14ebr
Copy link
Contributor Author

Ah yes I see what you want. Done in the latest commit!

Copy link
Collaborator

@mike-hunhoff mike-hunhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you for the updates @stevemk14ebr ! Requested minor changes. Also, looks like formatting is failing. Once you're done w/ the requested changes please run the following:

$ python -m black -l 120 --extend-exclude ".*_pb2.py" --check .

Remove --check to format identified files.

capa/ida/plugin/form.py Outdated Show resolved Hide resolved
capa/ida/plugin/form.py Outdated Show resolved Hide resolved
capa/ida/plugin/form.py Outdated Show resolved Hide resolved
@mike-hunhoff
Copy link
Collaborator

@stevemk14ebr recent changes look good but mypy identified some issues. Please run the following command locally to identify and address the issues:

mypy --config-file .github/mypy/mypy.ini --check-untyped-defs capa/ scripts/ tests/

@stevemk14ebr
Copy link
Contributor Author

done.

@mike-hunhoff
Copy link
Collaborator

@stevemk14ebr our code style checks appear to be failing as well. Please run the following command locally to identify and address the issues:

python -m black -l 120 --extend-exclude ".*_pb2.py" --check .

Copy link
Collaborator

@mike-hunhoff mike-hunhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thank you @stevemk14ebr !

@mike-hunhoff mike-hunhoff merged commit 7ef78fd into mandiant:master Jun 13, 2023
@stevemk14ebr stevemk14ebr deleted the cache_optimizations branch June 13, 2023 18:12
yelhamer pushed a commit to yelhamer/capa that referenced this pull request Jun 14, 2023
* Optimize cache and extractor interface

* Update changelog

* Run linter formatters

* Implement review feedback

* Move rulegen extractor construction to tab change

* Change rulegen cache construction behavior

* Adjust return values for CR, format

* Fix mypy errors

* Format

* Fix merge

---------

Co-authored-by: Stephen Eckels <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants