Optimize cache and extractor interface #1470

stevemk14ebr · 2023-05-03T17:44:52Z

These changed were discussed with @mike-hunhoff

Previously the CapaRuleGenFeatureCache accepted a list of function handles and then internally invoked _find_function_and_below_features on that list so that later calls such as get_all_function_features could read out cached features. Unfortunately, knowing the list of functions you want features out of is not always easy to know before hand (input is from UI, exploring recursively, etc), and in those cases you have to construct a new CapaRuleGenFeatureCache each time per function you'd like features from. This was a problem, because _find_global_features and _find_file_features() was also called in the cache constructor - with _find_file_features() in particular having non-trivial overhead. When you must continously extract features out of many functions the combined overhead of calling _find_file_features in the constructor and throwing away the function cache multiple times generates significant overhead in some applications.

The interface has therefore been changed so that functions such as get_all_function_features now use a helper _get_cached_func_node which will return cached function results if they exist OR populate the cache for that function at call time. A user of this interface can now construct CapaRuleGenFeatureCache one time at a higher global scope, where expensive _find_global_features and _find_file_features() will occur and call get_all_function_features as needed later.

This simplifies the tracking of the extractor and cache states as well in the form.py since now there is only one cache instance and it's re-used everywhere.

github-actions

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

CHANGELOG updated or no update needed, thanks! 😄

mike-hunhoff

Thank you @stevemk14ebr - good improvements here! I've left comments for your review.

capa/ida/plugin/form.py

stevemk14ebr · 2023-05-16T14:41:59Z

By delaying construction of the rulegen extractor and cache the one time file analysis is performed everytime the analyze button is clicked by the user in the rulegen dialog. This is moderately slow. Agree that this saves the user from the file analysis overhead if they don't use the rulegen feature, but it does slow down the rulegen feature for every new function it is used for.

stevemk14ebr · 2023-05-16T14:49:27Z

This new commit fixes the issue mentioned above. I construct specifically just the rulegen extractor and cache when the tab is changed to the rulegen feature. This makes analyze on each new feature fast for the rule generator.

mike-hunhoff · 2023-05-24T16:16:28Z

By delaying construction of the rulegen extractor and cache the one time file analysis is performed everytime the analyze button is clicked by the user in the rulegen dialog. This is moderately slow. Agree that this saves the user from the file analysis overhead if they don't use the rulegen feature, but it does slow down the rulegen feature for every new function it is used for.

Totally agree with the points you make here. My intent is for the initialization of the rulegen extractor and cache to only execute once at the beginning of load_capa_function_results, similar to how we handle the cached capa rule set:

capa/capa/ida/plugin/form.py

Lines 962 to 968 in 332853e

    
           def load_capa_function_results(self): 
        
               """ """ 
        
               if self.rulegen_ruleset_cache is None: 
        
                   # only reload rules if cache is empty 
        
                   self.rulegen_ruleset_cache = self.load_capa_rules() 
        
               else: 
        
                   logger.info("Using cached capa rules, click Clear to load rules from disk.")

So from a UX perspective the one-time final analysis kicks off the first time a user clicks the Anlayze button in the Rule Generator tab. This overhead then doesn't exist for subsequent clicks of the Anlayze button in the Rule Generator tab.

What are you thoughts @stevemk14ebr ?

stevemk14ebr · 2023-05-26T14:18:24Z

Ah yes I see what you want. Done in the latest commit!

mike-hunhoff

Great, thank you for the updates @stevemk14ebr ! Requested minor changes. Also, looks like formatting is failing. Once you're done w/ the requested changes please run the following:

$ python -m black -l 120 --extend-exclude ".*_pb2.py" --check .

Remove --check to format identified files.

capa/ida/plugin/form.py

…/capa into cache_optimizations

mike-hunhoff · 2023-06-07T14:52:40Z

@stevemk14ebr recent changes look good but mypy identified some issues. Please run the following command locally to identify and address the issues:

mypy --config-file .github/mypy/mypy.ini --check-untyped-defs capa/ scripts/ tests/

…/capa into cache_optimizations

stevemk14ebr · 2023-06-07T15:03:05Z

done.

mike-hunhoff · 2023-06-13T16:18:02Z

@stevemk14ebr our code style checks appear to be failing as well. Please run the following command locally to identify and address the issues:

python -m black -l 120 --extend-exclude ".*_pb2.py" --check .

…/capa into cache_optimizations

mike-hunhoff

great, thank you @stevemk14ebr !

* Optimize cache and extractor interface * Update changelog * Run linter formatters * Implement review feedback * Move rulegen extractor construction to tab change * Change rulegen cache construction behavior * Adjust return values for CR, format * Fix mypy errors * Format * Fix merge --------- Co-authored-by: Stephen Eckels <[email protected]>

Optimize cache and extractor interface

a9f9859

github-actions bot previously requested changes May 3, 2023

View reviewed changes

Update changelog

d7a594c

Run linter formatters

1f8319d

mike-hunhoff requested changes May 9, 2023

View reviewed changes

capa/ida/plugin/form.py Outdated Show resolved Hide resolved

capa/ida/plugin/form.py Show resolved Hide resolved

capa/ida/plugin/form.py Outdated Show resolved Hide resolved

Implement review feedback

9f29682

Move rulegen extractor construction to tab change

332853e

Change rulegen cache construction behavior

1e02f3a

Merge branch 'master' into cache_optimizations

4d901ba

mike-hunhoff requested changes Jun 6, 2023

View reviewed changes

capa/ida/plugin/form.py Outdated Show resolved Hide resolved

capa/ida/plugin/form.py Outdated Show resolved Hide resolved

capa/ida/plugin/form.py Outdated Show resolved Hide resolved

Stephen Eckels and others added 3 commits June 6, 2023 11:24

Adjust return values for CR, format

e7e0571

Merge branch 'cache_optimizations' of https://github.com/stevemk14ebr…

bf2a4b3

…/capa into cache_optimizations

Merge branch 'master' into cache_optimizations

685da96

Stephen Eckels added 2 commits June 7, 2023 11:02

Fix mypy errors

ca50708

Merge branch 'cache_optimizations' of https://github.com/stevemk14ebr…

26740ec

…/capa into cache_optimizations

Merge branch 'master' into cache_optimizations

091e335

Stephen Eckels and others added 5 commits June 13, 2023 12:25

Format

62ddc15

Merge branch 'cache_optimizations' of https://github.com/stevemk14ebr…

f82a8a3

…/capa into cache_optimizations

Merge branch 'master' into cache_optimizations

44bc27d

Fix merge

0a9f680

Merge branch 'cache_optimizations' of https://github.com/stevemk14ebr…

51badae

…/capa into cache_optimizations

mike-hunhoff approved these changes Jun 13, 2023

View reviewed changes

mike-hunhoff merged commit 7ef78fd into mandiant:master Jun 13, 2023

stevemk14ebr deleted the cache_optimizations branch June 13, 2023 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize cache and extractor interface #1470

Optimize cache and extractor interface #1470

stevemk14ebr commented May 3, 2023 •

edited

Loading

github-actions bot left a comment

mike-hunhoff left a comment

stevemk14ebr commented May 16, 2023 •

edited

Loading

stevemk14ebr commented May 16, 2023

mike-hunhoff commented May 24, 2023

stevemk14ebr commented May 26, 2023

mike-hunhoff left a comment

mike-hunhoff commented Jun 7, 2023

stevemk14ebr commented Jun 7, 2023

mike-hunhoff commented Jun 13, 2023

mike-hunhoff left a comment

Optimize cache and extractor interface #1470

Optimize cache and extractor interface #1470

Conversation

stevemk14ebr commented May 3, 2023 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

mike-hunhoff left a comment

Choose a reason for hiding this comment

stevemk14ebr commented May 16, 2023 • edited Loading

stevemk14ebr commented May 16, 2023

mike-hunhoff commented May 24, 2023

stevemk14ebr commented May 26, 2023

mike-hunhoff left a comment

Choose a reason for hiding this comment

mike-hunhoff commented Jun 7, 2023

stevemk14ebr commented Jun 7, 2023

mike-hunhoff commented Jun 13, 2023

mike-hunhoff left a comment

Choose a reason for hiding this comment

stevemk14ebr commented May 3, 2023 •

edited

Loading

stevemk14ebr commented May 16, 2023 •

edited

Loading