Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new feature: function call arguments #771

Open
mr-tz opened this issue Sep 10, 2021 · 19 comments · Fixed by #1678
Open

new feature: function call arguments #771

mr-tz opened this issue Sep 10, 2021 · 19 comments · Fixed by #1678
Labels
breaking-change introduces a breaking change that should be released in a major version enhancement New feature or request

Comments

@mr-tz
Copy link
Collaborator

mr-tz commented Sep 10, 2021

Summary

Can we create a way to associate function arguments (mostly for numbers and strings) with calls to known functions?

Possible syntax:

- call:
  - number: 4
  - api: CreateProcess

See discussion in #921 around syntax.

This is easier to understand by humans and we can be a little smarter in the analysis phase.

We should restrict this feature to analysis engines/formats/runtimes for which we can reliably extract the arguments (like .NET). Then, when its working well, we can try to backport to other engines/formats/runtimes (like x86). TBD if this sort of analysis is expected by all backends, e.g. SMDA.

Motivation

Looking for examples for #767 reminded me of the other most common use case for basic block subscopes...

Grouping function calls and their arguments, like

      - basic block:
        - and:
          - api: kernel32.QueryInformationJobObject
          - number: 0x3 = JobObjectBasicProcessIdList

or

        - basic block:
          - and:
            - api: SendMessage
            - number: 0x40a = WM_CAP_DRIVER_CONNECT
@Ana06
Copy link
Member

Ana06 commented Mar 22, 2022

Concerns (from last meeting):

  • parameters with or
  • bitfield, for example for CreateFile

@williballenthin
Copy link
Collaborator

williballenthin commented Mar 22, 2022

when referring to an argument, we should be able to refer to its specific index. we should also try to associate the argument with its declared name. so like:

api: CreateFileA
    arg[0]: "foo.exe"

and

api: CreateFileA
    lpName: "foo.exe"

how do we maintain these mappings? we'd need a database of APIs and their canonical argument names (ideally should match MSDN (windows) and man pages (posix)).

for MSDN, we should consider extracting the info we need from M$ provided winmd files: https://github.com/microsoft/win32metadata
alternatives might include using viv's API database or extract one from some sandbox, etc. but the winmd approach is "blessed" and supported.

we should push to have vivisect/vivisect#213 updated and merged.

@williballenthin
Copy link
Collaborator

williballenthin commented Mar 22, 2022

we'll need to figure out how to handle a subset of types commonly used for arguments, like pointers to strings.

does specifying a value as a string, like lpName: "foo.exe" imply the argument is a string (either ASCII or utf-16) and instruct the matching engine to resolve the data? and/or does the engine use an API database to determine the types of arguments ahead of time?

we should probably not go too far down this rabbit hole; handling structures is likely out of scope.

do we support regex against strings?

@williballenthin williballenthin changed the title New scope: call new feature: function call arguments Mar 22, 2022
@williballenthin
Copy link
Collaborator

williballenthin commented Mar 22, 2022

thought: if we migrate most of our rules to use this feature, then we could probably natively support decompiler backends, like ghidra and hex-rays.

we should consider the fragmentation of our analysis backends though. how do we handle the scenario when some backends do/n't support various features? we already almost see this with SMDA versus viv wrt FLIRT support.

@williballenthin williballenthin added enhancement New feature or request breaking-change introduces a breaking change that should be released in a major version labels Mar 31, 2022
@williballenthin
Copy link
Collaborator

we could add this as part of capa 4.0 (probably introduces insn scope) or defer for 5.0+ as this will be a breaking change to rule syntax.

@williballenthin
Copy link
Collaborator

williballenthin commented Mar 31, 2022

via #930 (comment) and above

probably want to support at least the following "types":

- operand[{0,1,n}].number: ...
- operand[{0,1,n}].string: ...
- operand[{0,1,n}].substring: ...
- operand[{0,1,n}].bytes: ...
- operand[{0,1,n}].flag: ...

@williballenthin
Copy link
Collaborator

master's thesis https://www.ru.nl/publish/pages/769526/joren_vrancken.pdf by @joren485 describes an IDA/Hex-Rays plugin that uses call-scope features to identify capabilities. they have good success, demonstrating that this is probably a useful addition to capa.

notably they use Hex-Rays decompilation as the source of their features.

@yelhamer
Copy link
Collaborator

one suggestion for this feature's syntax would be to use a format similar to the strace and ltrace utilities on Linux. Example:

- api: CreateThread(lpThreadAttributes=0x0, dwStackSize=, lpStartAddress=, lpParameter=, dwCreationFlags=0x4, lpThreadID=)

or maybe:

- api: CreateThread(lpThreadAttributes=0x0, dwCreationFlags=0x4) # match just these two arguments

we can also specify return values in this syntax similar to strace/ltrace:

- api: IsDebuggerPresent() == 0

the downsides to this approach are:

  • it seems a bit more clustered as opposed to the call scope, which I think looks pretty elegant compared to this approach.
  • we would need to find an efficient way to extract the api names and arguments, since otherwise this should introduce performance issues given the large number of api calls that are usually made by a sample.

upsides of this approach:

  • it would make the feature easily sharable between dynamic and static flavors, and should make writing rules that work both statically and dynamically easier.

@williballenthin
Copy link
Collaborator

williballenthin commented Jun 15, 2023

api: CreateThread(lpThreadAttributes=0x0, dwCreationFlags=0x4)

i do like some aspects of this syntax, particularly that its very human readable. human readability has always been a big goal for capa rule syntax. if we ultimately pick another solution, perhaps we can still support a shorthand like this, since its probably sufficient for many rules.

some additional considerations:

  • cannot express logic for the arguments, such as this OR that. but i think its on us to demonstrate if this would be used often. i think maybe it might for bitfield/enum arguments.
  • have to develop a parser for this rule syntax, and also find a way to show the user what went wrong when a rule is invalid
  • how to specify interpretation of the arguments, like 0x4 = CREATE_SUSPENDED? maybe like dwCreationFlags=0x4 (CREATE_SUSPENDED) or something?

@0x534a
Copy link

0x534a commented Jun 20, 2023

how do we maintain these mappings? we'd need a database of APIs and their canonical argument names (ideally should match MSDN (windows) and man pages (posix)).

If you are interested and if this is still relevant, I can provide an SQLite database containing API call definitions for Windows including their argument names. I scraped this information from the from the MSDN Offline Library 2009 back in 2019. So, the data basis is not the newest but should include the most relevant API calls.

However, this is an important point and should not be underestimated. The API traces differ greatly in terms of conformance to the MSDN. Based on my experience so far, CAPE has its own naming for arguments and the conformance is not the best. VMRay does a better job but I can fully understand that you chose CAPE since it is open source and there is a large data set of API traces available. The example shown below illustrates the differences in terms of the conformance. Please consider that these samples do not origin from the same sample.

CAPE (Sample 17beca96e3a7474622f5b23ff015c8783c0868a070cc5331db622de9b78dd45e from the avast repo):

{
    "timestamp": "2021-06-03 21:57:55,843",
    "thread_id": "1688",
    "caller": "0x743c1321",
    "parentcaller": "0x743c13c9",
    "category": "registry",
    "api": "RegOpenKeyExW",
    "status": true,
    "return": "0x00000000",
    "arguments": [
        {
            "name": "Registry",
            "value": "0x80000002",
            "pretty_value": "HKEY_LOCAL_MACHINE"
        },
        {
            "name": "SubKey",
            "value": "system\\CurrentControlSet\\control\\NetworkProvider\\HwOrder"
        },
        {
            "name": "Handle",
            "value": "0x000000e8"
        },
        {
            "name": "FullName",
            "value": "HKEY_LOCAL_MACHINE\\system\\CurrentControlSet\\control\\NetworkProvider\\HwOrder"
        }
    ],
    "repeated": 0,
    "id": 39
}

VMRay (Sample c0832b1008aa0fc828654f9762e37bda019080cbdd92bd2453a05cfb3b79abb3):

[0076.435] RegOpenKeyExW (in: hKey=0x80000001, lpSubKey="Software\\Microsoft\\Windows\\CurrentVersion\\Run", ulOptions=0x0, samDesired=0xf003f, phkResult=0x18ea40 | out: phkResult=0x18ea40*=0x4f0) returned 0x0

@mr-tz
Copy link
Collaborator Author

mr-tz commented Jun 21, 2023

Ouh, that seems like a very important point.

As a rule author I'd like to specify the name instead of a number (which name though? likely the one the sandbox uses which could be different as shown above OR the name from the MSDN documentation).

To match features (using multiple sandboxes) we'd want to focus on the arguments by number (mapped from the name).

So, for now it may be easiest to just use numbered arguments? And then add our own mapping later, potentially based on @0x534a's data.

@williballenthin
Copy link
Collaborator

note that in the example above from @0x534a, the two sandboxes doen't even recover the same number of arguments 🤦🏼

i guess each sandbox needs a database to map argument names back to argument indices. then capa can work with raw indices. capa can optionally also provide its own database of argument index <-> argument name to make rules more readable, such as the one that @0x534a offers.

maintaining these databases will be a bit tedious, but im not sure how we can get around it. i suppose once they're built and tested, updates shouldn't often be needed unless the sandboxes change.


we'll have to inspect the types of data emitted by the sandboxes for the arguments as well. i suspect there'll be some cases where one sandbox resolves a handle into some string (e.g., path) and another sandbox just gives the handle value. fun.

@yelhamer
Copy link
Collaborator

yelhamer commented Jul 3, 2023

regarding the different number of arguments for RegOpenKeyExW, it seems like that's how CAPE was programmed to handle that:

image

If we're going to create and maintain a mapping from CAPE argument names into msdn naming, then I propose we reach out to the CAPE team and see if we could work on updating the CAPE argument names into the msdn format there.

alternatively, perhaps we could add a modifier to the arguments feature to specify which calling convention the rule author has in mind? so something like this:

- call:
  - api: RegOpenKeyExW
  - arguments/cape:
    Registry: HKEY_LOCAL_MACHINE
    SubKey: system\\CurrentControlSet\\control\\NetworkProvider\\HwOrder    

and maybe consequently this?

- call:
  - api: RegOpenKeyExW
  - or
    - arguments/cape:
        Registry: HKEY_LOCAL_MACHINE
        SubKey: system\\CurrentControlSet\\control\\NetworkProvider\\HwOrder
    - arguments/msdn:
        hkey: 0x80000001
        lpSubKey: Software\\Microsoft\\Windows\\CurrentVersion\\Run

@mr-tz
Copy link
Collaborator Author

mr-tz commented Jul 3, 2023

we reach out to the CAPE team and see if we could work on updating the CAPE argument names into the msdn format there

+1 one that idea

I'm not a fan of the sandbox specific arguments. I think it would make rule writing and our code more complex and complicated than desired.

@kevoreilly
Copy link

I am all for updating the argument names to MSDN format within CAPE 👍

@kevoreilly
Copy link

kevoreilly commented Jul 5, 2023

It might be worth noting that CAPE sometimes enriches the output by adding fields that are technically not API arguments.

For example, the output from the NtReadFile hook includes the file path but this is not included in the arguments, rather is obtained by the hook from the handle argument.

@mr-tz
Copy link
Collaborator Author

mr-tz commented Jul 5, 2023

@0x534a, would you mind sharing your database? This could help to get the names updated in CAPE.

@0x534a
Copy link

0x534a commented Jul 5, 2023

I am all for updating the argument names to MSDN format within CAPE 👍

Yeah, that's pretty awesome and very appreciated! 🎉

@0x534a, would you mind sharing your database? This could help to get the names updated in CAPE.

The SQLite database can be downloaded from my OneDrive using the link https://1drv.ms/u/s!AqNdbwsLZ9qwgw7Z5izJe0OZg9t_?e=badlPF. The structure of the database is not too complex and should mostly be self-explanatory. For example, to search for all arguments of a given API call (in this case RegOpenKeyEx) you can use the following SQL statement:

SELECT a.name AS api_function, 
       p.name AS argument_name, 
       t.name AS argument_type, 
       p.is_in, 
       p.is_out, 
       p.description 
FROM   api_calls a, 
       api_call_params p, 
       types t 
WHERE  p.api_call_id = a.id 
       AND p.type_id = t.id 
       AND a.NAME = "RegOpenKeyEx" 
       AND a.target_os = "windows" 
ORDER  BY p.id ASC; 

Some constraints:

  • The database does not include structs or enums. So, no nested structures of arguments can be found.
  • The position of an argument is not explicitly stated in the data as own column. Nevertheless, it can be deduced from the ID of the argument (primary key of the table api_call_params).
  • The database contains API calls for different platforms. To get the best results simply filter by the OS windows or the calling convention WINAPI.
  • Not all of the API calls are documented in the MSDN. For undocumented API calls (especially NTAPI), I scraped the website http://undocumented.ntinternals.net. The site seems to be offline right now. Based on the naming of parameters on the website, I can not guarantee that the argument names always make sense. This is more like a best-effort approach. ;)

If there are any question, I'm happy to help.

@mr-tz
Copy link
Collaborator Author

mr-tz commented Jul 6, 2023

Great, thank you very much!!

@yelhamer yelhamer moved this from todo to next up in @yelhamer GSoC 2023 Aug 2, 2023
@yelhamer yelhamer linked a pull request Aug 3, 2023 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change introduces a breaking change that should be released in a major version enhancement New feature or request
Projects
Status: next up
Development

Successfully merging a pull request may close this issue.

6 participants