Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explore CAPE sandbox report file format #1535

Closed
williballenthin opened this issue Jun 12, 2023 · 8 comments · Fixed by #1546
Closed

explore CAPE sandbox report file format #1535

williballenthin opened this issue Jun 12, 2023 · 8 comments · Fixed by #1546
Labels
documentation Improvements or additions to documentation dynamic related to dynamic analysis flavor question Further information is requested

Comments

@williballenthin
Copy link
Collaborator

use this issue to describe the interesting parts of the CAPE sandbox report file format. describe how we could extract data into capa-level features.

@williballenthin williballenthin added documentation Improvements or additions to documentation question Further information is requested labels Jun 12, 2023
@williballenthin
Copy link
Collaborator Author

williballenthin commented Jun 12, 2023

using 0000a65749f5902c4d82ffa701198038f0b4870b00a27cfca109f8f933476d82.json from the avast repo

image

general layout like this:

image

behavior.processes[].calls[] has the API trace:

image

from this we can extract:

  • PID
  • TID
  • return address
  • API trace:
    • API name
    • return value
    • arguments
      • value
      • name

the return address feature would potentially enable a "call stack scope", like "all events found with the same return address".
however, im not sure how to interpret the addresses listed there, because the memory map for the process isn't available? so im not sure how to restrict these values?

when the argument is a string, it is parsed as a string, not as a pointer to some memory region:

image

some enums are also parsed into human readable strings:

image

handles are not consistently tracked, such as the hKey referenced here:

image

@williballenthin williballenthin moved this to in progress in @yelhamer GSoC 2023 Jun 12, 2023
@williballenthin
Copy link
Collaborator Author

@yelhamer
Copy link
Collaborator

yelhamer commented Jun 13, 2023

Categorizing the report sections by level of utility (redundant, future-use, to-be-used):

to-be-used:

  • static: this report section contains useful information for extracting the file features of a sample, such as: imports and exports, sections, format, as well as other information that can be used in the section scope.

  • strings: this report section contains the strings extracted from the sample as well as files dropped by the sample. this will be useful for extracting string features.

  • network: this section gives a limited overview of the extracted network traffic, which is limited to: protocol, src ip and port, dst ip and port, as well as non-useful information such as the packets' offset and timestamp. For more in depth network analysis (content for example) we'd need to use the extracted pcap files.

  • commands and mutexes: these should be determinable from the call traces, so maybe we'd want to only extract them just as string features and not as separate features? If we chose to include a commands feature, then we'd probably want to extract that from the process tree section as well, since the environment variables (including the CommandLine variable) are specified for each process, which means that we could extract commands at a process scope:
    image

  • files, registry keys, and services: I think these should be included in the case that they are manipulated by means of an obfuscated powershell command (which is common), which rules based on api trace matching wouldn't be able to detect. If we chose to include them, I think we should add a member per each feature specifying whether the file/key was created, read, or deleted, or whether the service was created or started.

  • procdump and payloads: these can be used to extract strings, albeit not many. Another use can also be look at the matched rules for each dumped payload/process image try to extract string/bytes features from that:
    image

  • api calls: this section section should yield the api features, as well as number and string features from the arguments.

  • CAPE.config: this section includes the extracted configuration for known malware families. we should return strings from this when available.

  • signatures: can contain several features such as: commands, urls, etc.

future use:

  • CAPE.payloads: it might be interesting to give users the option in the future to download these payloads and pass them to static extractor (viv, ida, etc.), which would give capa the ability to unpack/deobfuscate executables.
  • detection2pid: these include the malware family cape thinks the malware belongs to for each pid.

redundant:

  • behavior.enhanced: this report section contains detected events such as "loads file" or "creates a registry key", all of which should be detectable using capa rules.
  • dropped files: this information can be deduced from the files section as well as api calls.

@yelhamer
Copy link
Collaborator

extracted features and the associated report locations:

  • api: the call trace for each process
  • strings: strings report section, api arguments, CAPE.config, yara matches (if they include strings), environ section of the process tree, signatures section.
  • numbers: api arguments.
  • bytes: yara matches (if they include bytes).
  • network: network section and pcap files parsing.
  • imports/exports: static section.
  • section names: static section.
  • commands: commands section, the environ field of the process tree section.
  • files: the {create, read, deleted} files section.
  • registry keys: the {create, read, deleted} registry keys section.
  • services: the {create, started} services section.

@williballenthin williballenthin added the dynamic related to dynamic analysis flavor label Jun 14, 2023
@yelhamer yelhamer linked a pull request Jun 19, 2023 that will close this issue
6 tasks
@mr-tz
Copy link
Collaborator

mr-tz commented Jul 6, 2023

The info.version field lists the CAPE version, e.g. 2.2-CAPE that we currently use from the AVAST database.
We should ensure that this is what we expect as I've noticed small differences, e.g. to 2.4-CAPE (here regarding how imports are organized).

@yelhamer
Copy link
Collaborator

yelhamer commented Jul 6, 2023

especially once we've added the call scope. once that has been added we should make sure the cape version being used has the msdn names (not the legacy ones).

@doomedraven
Copy link
Contributor

doomedraven commented Jul 6, 2023

@mr-tz
Copy link
Collaborator

mr-tz commented Jul 6, 2023

I think the change may have been introduced when you improved the parser reusability (kevoreilly/CAPEv2#763) or before. Maybe I've also made it up when trying to fabricate the data locally 😮

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation dynamic related to dynamic analysis flavor question Further information is requested
Projects
Status: done
Development

Successfully merging a pull request may close this issue.

4 participants