Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More fixes and improvements to pattern matching #64

Merged
merged 8 commits into from
Aug 19, 2024
Merged

More fixes and improvements to pattern matching #64

merged 8 commits into from
Aug 19, 2024

Conversation

ViRb3
Copy link
Contributor

@ViRb3 ViRb3 commented Aug 17, 2024

  • Support pattern negated byte

This was a very simple one.

  • Optimize pattern sub-matches

With the current implementation of needle search + "truncate right" to handle sub-matches, we end up re-scanning the same regions multiple times. In some cases, this is negligible, in others, it's really bad. There's probably a better way to handle this, but to fix the most basic cases, we now cache each region (start + end address), and skip regex matching if the exact same address was processed before.

  • Fix one-off pattern mismatch

This prevented one of my test cases to match.

  • Return end index

Changes the matching function's signature to also return end indexes, in preparation for unit tests.

  • Deduplicate and sort results

This is workaround for the 2nd issue above.

@stevemk14ebr
Copy link
Collaborator

I am working on reviewing this, I don't quite understand why we need the caches, they come with a very high overhead if the needle is common and the region being scanned contains the needle a lot - we already have OOM issues on large samples so I am weary of including that particular commit TBH.

Our needle scan shouldn't really return duplicate regions (I could be missing something). Let's say the needle is a simple AA BB and the mem is FF AA 00 BB DD AA BB 00 AA BB 11 we'd get needMatches at 5 and 8. Then lets says the regex the needle was picked from is [0-2] AA BB 00 we'd scan start at the most pessimistic -2 == needleOffset or BB DD AA BB 00 for the first range and BB 00 AA BB 11 for the second range. I can't see how these would every land on the exact same start since our searches start from multiple needle locations.

@stevemk14ebr stevemk14ebr merged commit d7d9a98 into mandiant:master Aug 19, 2024
2 checks passed
@stevemk14ebr
Copy link
Collaborator

stevemk14ebr commented Aug 19, 2024

I've reverted the cache for now, I would consider adding it back if you can prove to me (ideally via a unit test in pattern_test.go) that it is necessary. Right now I cannot justify the memory overhead it introduces and it doesn't appear to me it's necessary for the kind of patterns used by GoReSym itself.

Thanks for the continued improvements!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants