Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filebeat] VirusTotal Livehunt dataset - WIP #21815

Closed
wants to merge 36 commits into from

Conversation

dcode
Copy link
Contributor

@dcode dcode commented Oct 14, 2020

THIS IS CURRENTLY IN DRAFT

What does this PR do?

Adds initial support for streaming VirusTotal Livehunt data via Filebeat httpjson input from VT API endpoint or via a kafka broker input, allowing multi-stage pipeline (also helpful for testing).

Why is it important?

Data from VirusTotal (VT) is important for threat research. The Livehunt feature allows organizations to enable one or many YARA rules in one or many rulesets. This module uses the Livehunt Notification API to stream VT file objects into an ECS-compatible mapping, where possible, and an ECS-styled mapping elsewhere.

VirusTotal is just one source of file events, which are a bit different than other security-related logging. Making this data available and standardized in Elasticsearch will allow analysis that combines the existing security event logging from network and endpoints with the file objects that traverse those mediums.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

How to test this PR locally

My current testing procedures are documented in x-pack/filebeat/module/virustotal/README.md. I will attach raw ndjson logs that contain a sample of original events covering the use cases.

Related issues

Use cases

Feature: VirusTotal Livehunt dataset

  Scenario: Poll VirusTotal Livehunt HTTP API
    When Filebeat polls the Livehunt notification API
    Then it receives a set of the most recent notifications of file objects
    And common file and notification metadata is transformed into standardized mappings

  # The second scenario is the same as the first, except events are consumed from Kafka
  Scenario: Subscribe to Kafka topic to consume livehunt events
    When Filebeat consumes livehunt events from a Kafka broker
    Then it receives a set of the most recent notifications of file objects
    And common file and notification metadata is transformed into standardized mappings

  # This scenario picks up after the above 2 scenarios
  Scenario: Process ELF file notifications
    When Filebeat consumes a livehunt event
    And metadata indicates the file is an ELF object
    Then format ELF-related metadata into standardized ELF fields and common file fields

  # This scenario picks up after the above 2 scenarios
  Scenario: Process PE file notifications
    When Filebeat consumes a livehunt event
    And metadata indicates the file is a PE object
    Then format PE-related metadata into standardized PE fields and common file fields

Screenshots

image

Logs

TODO

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 14, 2020
@dcode dcode self-assigned this Oct 14, 2020
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 14, 2020
@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 14, 2020

💔 Tests Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #21815 updated

    • Start Time: 2021-01-21T20:16:38.940+0000
  • Duration: 50 min 53 sec

  • Commit: f78e752

Test stats 🧪

Test Results
Failed 1
Passed 5135
Skipped 574
Total 5710

Test errors 1

Expand to view the tests failures

Build&Test / x-pack/filebeat-build / test_fileset_file_150_virustotal – x-pack.filebeat.tests.system.test_xpack_modules.XPackTest
    Expand to view the error details

     Exception: Key 'virustotal.packers' found in event is not documented! 
    

    Expand to view the stacktrace

     a = (<test_xpack_modules.XPackTest testMethod=test_fileset_file_150_virustotal>,)
    
        @wraps(func)
        def standalone_func(*a):
    >       return func(*(a + p.args), **p.kwargs)
    
    ../../build/ve/docker/lib/python3.7/site-packages/parameterized/parameterized.py:518: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    ../../filebeat/tests/system/test_modules.py:99: in test_fileset_file
        cfgfile=cfgfile)
    ../../filebeat/tests/system/test_modules.py:183: in run_on_file
        self.assert_fields_are_documented(obj)
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <test_xpack_modules.XPackTest testMethod=test_fileset_file_150_virustotal>
    evt = {'@timestamp': '2020-10-03T21:06:13.000Z', 'agent': {'ephemeral_id': 'e514d923-3c34-4a94-af99-a406ce4155b7', 'id': 'a7... '2021-01-21T20:51:49.271Z', 'dataset': 'virustotal.livehunt', 'ingested': '2021-01-21T20:51:50.425388415Z', ...}, ...}
    
        def assert_fields_are_documented(self, evt):
            """
            Assert that all keys present in evt are documented in fields.yml.
            This reads from the global fields.yml, means `make collect` has to be run before the check.
            """
            expected_fields, dict_fields, aliases = self.load_fields()
            flat = self.flatten_object(evt, dict_fields)
        
            def field_pattern_match(pattern, key):
                pattern_fields = pattern.split(".")
                key_fields = key.split(".")
                if len(pattern_fields) != len(key_fields):
                    return False
                for i in range(len(pattern_fields)):
                    if pattern_fields[i] == "*":
                        continue
                    if pattern_fields[i] != key_fields[i]:
                        return False
                return True
        
            def is_documented(key, docs):
                if key in docs:
                    return True
                for pattern in (f for f in docs if "*" in f):
                    if field_pattern_match(pattern, key):
                        return True
                return False
        
            for key in flat.keys():
                metaKey = key.startswith('@metadata.')
                # Range keys as used in 'date_range' etc will not have docs of course
                isRangeKey = key.split('.')[-1] in ['gte', 'gt', 'lte', 'lt']
                if not(is_documented(key, expected_fields) or metaKey or isRangeKey):
    >               raise Exception("Key '{}' found in event is not documented!".format(key))
    E               Exception: Key 'virustotal.packers' found in event is not documented!
    
    ../../libbeat/tests/system/beat/beat.py:729: Exception 
    

Steps errors 4

Expand to view the steps failures

`filebeat-Lint - make -C filebeat check;

make -C filebeat update;
make check-no-changes;`

  • Took 2 min 36 sec . View more details on here
  • Description: make -C filebeat check;make -C filebeat update;make check-no-changes;
`x-pack/filebeat-Lint - make -C x-pack/filebeat check;

make -C x-pack/filebeat update;
make check-no-`

  • Took 2 min 1 sec . View more details on here
  • Description: make -C x-pack/filebeat check;make -C x-pack/filebeat update;make check-no-changes;
x-pack/filebeat-build - mage build test
  • Took 29 min 32 sec . View more details on here
  • Description: mage build test
Error signal
  • Took 0 min 0 sec . View more details on here
  • Description: Error 'hudson.AbortException: script returned exit code 2'

Log output

Expand to view the last 100 lines of log output

[2021-01-21T21:02:20.980Z] FAILED tests/system/test_xpack_modules.py::XPackTest::test_fileset_file_150_virustotal
[2021-01-21T21:02:20.980Z] ================== 1 failed, 305 passed in 1274.19s (0:21:14) ==================
[2021-01-21T21:02:21.238Z] >> python test: Integration Testing Complete
[2021-01-21T21:02:21.238Z] Error: running "/go/src/github.com/elastic/beats/build/ve/docker/bin/pytest --timeout=90 --durations=20 --junit-xml=build/TEST-python-integration.xml tests/system/test_filebeat_xpack.py tests/system/test_http_endpoint.py tests/system/test_xpack_modules.py" failed with exit code 1
[2021-01-21T21:02:24.526Z] Error: running "docker-compose -p filebeat_8_0_0_89772cbfef-snapshot run -e DOCKER_COMPOSE_PROJECT_NAME=filebeat_8_0_0_89772cbfef-snapshot -e BEAT_STRICT_PERMS=false -e STACK_ENVIRONMENT=snapshot -e TESTING_ENVIRONMENT=snapshot -e GOCACHE=/go/src/github.com/elastic/beats/build/docker-gocache -v /var/lib/jenkins/workspace/Beats_beats_PR-21815/pkg/mod/cache/download:/gocache:ro -e GOPROXY=file:///gocache,direct -e EXEC_UID=1158 -e EXEC_GID=1159 -e TEST_COVERAGE=true -e RACE_DETECTOR=true -e TEST_TAGS=null,oracle -e MODULE=virustotal -e BEATS_INSIDE_INTEGRATION_TEST_ENV=true -e GOFLAGS=-mod=readonly beat /go/src/github.com/elastic/beats/x-pack/filebeat/build/mage-linux-amd64 pythonIntegTest" failed with exit code 1
[2021-01-21T21:02:24.871Z] Client: Docker Engine - Community
[2021-01-21T21:02:24.871Z]  Version:           20.10.2
[2021-01-21T21:02:24.871Z]  API version:       1.41
[2021-01-21T21:02:24.871Z]  Go version:        go1.13.15
[2021-01-21T21:02:24.871Z]  Git commit:        2291f61
[2021-01-21T21:02:24.871Z]  Built:             Mon Dec 28 16:17:32 2020
[2021-01-21T21:02:24.871Z]  OS/Arch:           linux/amd64
[2021-01-21T21:02:24.871Z]  Context:           default
[2021-01-21T21:02:24.871Z]  Experimental:      true
[2021-01-21T21:02:24.871Z] 
[2021-01-21T21:02:24.871Z] Server: Docker Engine - Community
[2021-01-21T21:02:24.871Z]  Engine:
[2021-01-21T21:02:24.871Z]   Version:          20.10.2
[2021-01-21T21:02:24.871Z]   API version:      1.41 (minimum version 1.12)
[2021-01-21T21:02:24.871Z]   Go version:       go1.13.15
[2021-01-21T21:02:24.871Z]   Git commit:       8891c58
[2021-01-21T21:02:24.871Z]   Built:            Mon Dec 28 16:15:09 2020
[2021-01-21T21:02:24.871Z]   OS/Arch:          linux/amd64
[2021-01-21T21:02:24.871Z]   Experimental:     false
[2021-01-21T21:02:24.871Z]  containerd:
[2021-01-21T21:02:24.871Z]   Version:          1.4.3
[2021-01-21T21:02:24.871Z]   GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
[2021-01-21T21:02:24.871Z]  runc:
[2021-01-21T21:02:24.871Z]   Version:          1.0.0-rc92
[2021-01-21T21:02:24.871Z]   GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
[2021-01-21T21:02:24.871Z]  docker-init:
[2021-01-21T21:02:24.871Z]   Version:          0.19.0
[2021-01-21T21:02:24.871Z]   GitCommit:        de40ad0
[2021-01-21T21:02:24.871Z] Unable to find image 'alpine:3.4' locally
[2021-01-21T21:02:25.807Z] 3.4: Pulling from library/alpine
[2021-01-21T21:02:26.066Z] c1e54eec4b57: Pulling fs layer
[2021-01-21T21:02:26.325Z] c1e54eec4b57: Verifying Checksum
[2021-01-21T21:02:26.325Z] c1e54eec4b57: Download complete
[2021-01-21T21:02:26.325Z] c1e54eec4b57: Pull complete
[2021-01-21T21:02:26.325Z] Digest: sha256:b733d4a32c4da6a00a84df2ca32791bb03df95400243648d8c539e7b4cce329c
[2021-01-21T21:02:26.325Z] Status: Downloaded newer image for alpine:3.4
[2021-01-21T21:02:28.515Z] + python .ci/scripts/pre_archive_test.py
[2021-01-21T21:02:30.419Z] Copy ./x-pack/filebeat/build into build/x-pack/filebeat/build
[2021-01-21T21:02:30.429Z] Running in /var/lib/jenkins/workspace/Beats_beats_PR-21815/src/github.com/elastic/beats/build
[2021-01-21T21:02:30.729Z] + rm -rf ve
[2021-01-21T21:02:30.729Z] + find . -type d -name vendor -exec rm -r {} ;
[2021-01-21T21:02:30.741Z] Recording test results
[2021-01-21T21:02:31.892Z] [Checks API] No suitable checks publisher found.
[2021-01-21T21:02:32.246Z] + tar --version
[2021-01-21T21:02:32.604Z] + tar --exclude=test-build-artifacts-x-pack/filebeat-build.tgz -czf test-build-artifacts-x-pack/filebeat-build.tgz .
[2021-01-21T21:03:19.431Z] [INFO] Override default googleStorageUpload with some sleep
[2021-01-21T21:03:19.442Z] Sleeping for 1 min 38 sec
[2021-01-21T21:04:57.458Z] [Google Cloud Storage Plugin] Found 1 files to upload from pattern: test-build-artifacts-x-pack/filebeat-build.tgz
[2021-01-21T21:04:57.837Z] [Google Cloud Storage Plugin] Uploading: test-build-artifacts-x-pack/filebeat-build.tgz
[2021-01-21T21:05:09.958Z] + python .ci/scripts/search_system_tests.py
[2021-01-21T21:05:09.974Z] [INFO] system-tests='build/x-pack/filebeat/build/system-tests'. If no empty then let's create a tarball
[2021-01-21T21:05:10.299Z] + tar --version
[2021-01-21T21:05:10.600Z] + tar --exclude=x-pack-filebeat--system-tests-linux.tgz -czf x-pack-filebeat--system-tests-linux.tgz build/x-pack/filebeat/build/system-tests
[2021-01-21T21:05:37.167Z] [INFO] Override default googleStorageUpload with some sleep
[2021-01-21T21:05:37.178Z] Sleeping for 41 sec
[2021-01-21T21:06:18.191Z] [Google Cloud Storage Plugin] Found 1 files to upload from pattern: x-pack-filebeat--system-tests-linux.tgz
[2021-01-21T21:06:18.251Z] [Google Cloud Storage Plugin] Uploading: x-pack-filebeat--system-tests-linux.tgz
[2021-01-21T21:06:24.847Z] Client: Docker Engine - Community
[2021-01-21T21:06:24.847Z]  Version:           20.10.2
[2021-01-21T21:06:24.847Z]  API version:       1.41
[2021-01-21T21:06:24.847Z]  Go version:        go1.13.15
[2021-01-21T21:06:24.847Z]  Git commit:        2291f61
[2021-01-21T21:06:24.847Z]  Built:             Mon Dec 28 16:17:32 2020
[2021-01-21T21:06:24.847Z]  OS/Arch:           linux/amd64
[2021-01-21T21:06:24.847Z]  Context:           default
[2021-01-21T21:06:24.847Z]  Experimental:      true
[2021-01-21T21:06:24.847Z] 
[2021-01-21T21:06:24.847Z] Server: Docker Engine - Community
[2021-01-21T21:06:24.847Z]  Engine:
[2021-01-21T21:06:24.847Z]   Version:          20.10.2
[2021-01-21T21:06:24.847Z]   API version:      1.41 (minimum version 1.12)
[2021-01-21T21:06:24.847Z]   Go version:       go1.13.15
[2021-01-21T21:06:24.847Z]   Git commit:       8891c58
[2021-01-21T21:06:24.847Z]   Built:            Mon Dec 28 16:15:09 2020
[2021-01-21T21:06:24.847Z]   OS/Arch:          linux/amd64
[2021-01-21T21:06:24.847Z]   Experimental:     false
[2021-01-21T21:06:24.847Z]  containerd:
[2021-01-21T21:06:24.847Z]   Version:          1.4.3
[2021-01-21T21:06:24.847Z]   GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
[2021-01-21T21:06:24.847Z]  runc:
[2021-01-21T21:06:24.847Z]   Version:          1.0.0-rc92
[2021-01-21T21:06:24.847Z]   GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
[2021-01-21T21:06:24.847Z]  docker-init:
[2021-01-21T21:06:24.847Z]   Version:          0.19.0
[2021-01-21T21:06:24.847Z]   GitCommit:        de40ad0
[2021-01-21T21:06:30.607Z] Failed in branch x-pack/filebeat-build
[2021-01-21T21:06:30.684Z] Stage "Packaging" skipped due to earlier failure(s)
[2021-01-21T21:06:30.728Z] Running in /var/lib/jenkins/workspace/Beats_beats_PR-21815/src/github.com/elastic/beats
[2021-01-21T21:06:30.983Z] Running on Jenkins in /var/lib/jenkins/workspace/Beats_beats_PR-21815
[2021-01-21T21:06:31.057Z] [INFO] getVaultSecret: Getting secrets
[2021-01-21T21:06:31.138Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2021-01-21T21:06:31.742Z] + chmod 755 generate-build-data.sh
[2021-01-21T21:06:31.742Z] + ./generate-build-data.sh https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-21815/ https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-21815/runs/5 FAILURE 2992541
[2021-01-21T21:06:32.292Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-21815/runs/5/steps/?limit=10000 -o steps-info.json
[2021-01-21T21:06:32.843Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-21815/runs/5/tests/?status=FAILED -o tests-errors.json

🐛 Flaky test report

❕ There are test failures but not known flaky tests.

Expand to view the summary

Test stats 🧪

Test Results
Failed 1
Passed 5135
Skipped 574
Total 5710

Genuine test errors 1

💔 There are test failures but not known flaky tests, most likely a genuine test failure.

  • Name: Build&Test / x-pack/filebeat-build / test_fileset_file_150_virustotal – x-pack.filebeat.tests.system.test_xpack_modules.XPackTest

@dcode dcode force-pushed the dcode/virustotal-module branch from a72c033 to 5a7e08b Compare October 15, 2020 16:17
dcode and others added 4 commits October 16, 2020 10:34
- Provides input directly from VT API using key or via kafka topic
- Implements filebeat transforms for many common [file object
fields](https://developers.virustotal.com/v3.0/reference#files)
- Implements filebeat transforms for many common [PE
fields](https://developers.virustotal.com/v3.0/reference#pe_info)
- Implements filebeat transforms for many common [ELF
fields](https://developers.virustotal.com/v3.0/reference#elf_info)
- Included some notes in README that I used to help develop and test this
@dcode dcode force-pushed the dcode/virustotal-module branch from 23961c8 to 74f346f Compare October 16, 2020 15:39
@peasead
Copy link
Contributor

peasead commented Oct 16, 2020

VirusTotal ECS RFC
elastic/ecs#1034

@dcode
Copy link
Contributor Author

dcode commented Oct 16, 2020

So, I think we're pretty close functionally. Gonna smooth out some documentation and need to implement some tests...but not sure how that works. If anyone wants to play with this and trying to get started, I can give you a hand. I think we're about ready to bounce the schema off the @elastic/ecs team and working group to negotiate extensions and renaming for fields.

dcode and others added 9 commits October 19, 2020 13:37
- Provides input directly from VT API using key or via kafka topic
- Implements filebeat transforms for many common [file object
fields](https://developers.virustotal.com/v3.0/reference#files)
- Implements filebeat transforms for many common [PE
fields](https://developers.virustotal.com/v3.0/reference#pe_info)
- Implements filebeat transforms for many common [ELF
fields](https://developers.virustotal.com/v3.0/reference#elf_info)
- Included some notes in README that I used to help develop and test this
@dcode dcode force-pushed the dcode/virustotal-module branch from 60f0b3c to da2d1db Compare October 19, 2020 18:51
@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 29, 2020

🐛 Flaky test report

❕ There are test failures but not known flaky tests.

Expand to view the summary

Test stats 🧪

Test Results
Failed 1
Passed 1947
Skipped 259
Total 2207

Genuine test errors 1

💔 There are test failures but not known flaky tests, most likely a genuine test failure.

    * **Name**: `Build&Test / x-pack/filebeat-build / test_fileset_file_100_virustotal – x-pack.filebeat.tests.system.test_xpack_modules.XPackTest`

@dcode dcode marked this pull request as ready for review October 30, 2020 20:08
@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@dcode
Copy link
Contributor Author

dcode commented Oct 30, 2020

This isn't perfect, but it passes the local tests now and has docs by @peasead. I would welcome feedback on structure and/or style

Copy link

@andrewstucki andrewstucki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look at some of the field mappings, haven't done a pass over everything yet, but awhile ago took a look at how we'd map some more detailed binary data info (based off of an experiment) into ECS-style fields and, accordingly, highlighted some of the PE/ELF info that I had thoughts about in this PR.

Is there a plan to do any Mach-O binaries?

description: >
Number of ELF Section Headers.
type: long
- name: sections

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure given that virus total returns information about whether artifacts are malicious to begin with, but I imagine that entropy calculations and or hashes might be useful to retain here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the intention is to keep all the info for the time being. If users don't want a particular fieldset, it can be dropped in the filebeat config or ingest processor. The section data has chi2 calculations and entropy. Virustotal doesn't provide an overall status of malicious or benign, but offers community votes of that, and individual engine assessments

Copy link
Contributor Author

@dcode dcode Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to comment about imported symbols below, we could normalize section data with something like:

file.*.sections:

{ 
     "virtual_address": 4096,
     "size": 2353664,
     "entropy": 6.37,
     "name": ".text",
     "flags": "rx"
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After working through several examples and reading on the various binary executable formats, I've come up with this. Thoughts @andrewstucki ?

// Abstract structure for all binary types, missing fields for a given data source will be excluded
{
    name: "[keyword] Name of code section",
    physical_offset: "[keyword] Offset of the section from the beginning of the segment, in hex",
    physical_size: "[long] Size of the code section in the file in bytes",
    virtual_address: "[keyword] relative virtual memory address when loaded",
    virtual_size: "[long] Size of the section in bytes when loaded into memory",
    flags: "[keyword] List of flag values as strings for this section",
    type: "[keyword] Section type as string, if applicable",
    segment_name: "[keyword] Name of segment for this section, if applicable"
}


// Mach-O example
{
    file.macho.sections: [
        {
            name: "__nl_symbol_ptr",
            flags: ["S_8BYTE_LITERALS"],
            type: "S_CSTRING_LITERALS",
            segment_name: "__DATA"
        }, ...
    ]
}
// ELF example
{
    file.elf.sections: [
        {
            name: ".data",
            physical_offset: "0x3000",
            physical_size: 16,
            virtual_address: "0x4000",
            flags: ["WA"], // This is how VT presents the data. Pretty sure this maps to ["WRITE", "ALLOC"], but I don't have an exhaustive mapping
            type: "PROGBITS"
        }, ...
    ]
}

// PE example

{
    file.pe.sections: [
        {
            name: ".data",
            physical_size: 2542592,
            virtual_address: "0x2DE000",
            virtual_size: 2579264,
            flags: ["rw"], // Again, this is how VT presents it. Likely maps to ["MEM_READ", "MEM_WRITE"], but I don't have an exhaustive mapping
            type: ".data",
            entropy: 6.83,
            chi2: 13360996
        }, ...
    ]
}

I'm least pleased by my Mach-O example, but I think that's mostly limited to how VT provides the data currently. It provides offset info for each segment, and then lists sections that exist within the segment with no info at all. This is the only reason, I think to even mention the segment name, though that could be omitted and and be listed within each segment data as a list of included sections.

Finally, I think this at least works for a common fieldset of section data. The flags we can improve over time since it will be a list of keywords, and for PE, I think it's hard-coded as an attribute of the section name/type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcode that looks pretty similar to what I was thinking. Thinking through the flags bit trips me up a little too--I'm thinking that eventually we may want some verbiage in the description that says something to the effect of "use whatever constant name is found in the spec/OS headers" >_>. If we wanted to be strict about it, a VT filebeat module could always just normalize the VT payload to whatever we wanted.

Also, for reference, sections do have offset and size info associated with them, so despite the VT api shortcomings, pretty sure the same fields would still be useful. I'd be fine suggesting the entropy and chi2 calculations as fields too, at least as a first pass, in the RFC. Statistical byte calculations seem pretty common in the binary analysis-side of security.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree on all points. on it

Type of exported symbol
type: keyword
default_field: false
- name: imports

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it would make sense for this level to describe an actual linked in library and for the stuff currently nested here (i.e. name, type, etc.) to specify the symbols imported by a library. Otherwise you get symbols free of context from where they're actually being imported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Ideally, I'd like to see a common representation across ELF, PE, and Mach-O. Unfortunately, these formats don't work the same, especially in the way they import symbols. I think making exports and imports nested rather than a group makes sense to maintain context. Making these a nested dictionary with common fields for each binary type might be the right answer. Not all binary types will have all fields populated, but at least consistent across formats. I'll play with this.

Copy link

@andrewstucki andrewstucki Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it's totally possible to try and resolve the libraries that symbols come from in ELF format, see example. There's actually default support for resolving the libraries in Go's standard library. I couldn't tell if the ndjson example that is dropped here actually does that as part of the VirusTotal service for ELF files, but if it does, it would probably make sense to scope these.


Edit: BTW, Go supports this through the GNU symbol versioning tables introduced to support Linux dynamic symbol versioning, so if a symbol isn't versioned, you'll be hard-pressed to get this information from the binary itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to have a common interface across PE, ELF, and MachO for this. LIEF actually does this as an analytic framework, but VT doesn't expose this data equally across all binary types. We could implement a common fieldset for imports, and applications can populate it as they are able to.

Proposal:

file.*.imported_symbols:

{
  "name": "my_symbol", "size": 0, "value": 0, "type": "function", "library_name": "my_library.dll"
}

In the case of PE, the VT data would permit populating symbol name, library name, and we can derive a type of "function". For ELF, the data provides symbol name and type (the samples I've seen), for Mach-O... VT doesn't give us any symbols... just a list of linked libraries, which could feasibly go somewhere else as a flat list, say file.*.linked_libraries.

Anything not provided by the source (VT in this case) would be omitted. Another application could feasibly populate this data with much greater detail. The library_name for ELF could be resolved as you say, but it's not coded in the binary specifically (I think).

type: flattened
description: >
If the PE contains resources, some info about them
- name: resource_languages

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering, does VT return language/type information tied to the specific resources it's enumerating? Because I would imagine this and the field below would show up in the resource_details albeit not aggregated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. resource_types and resource_languages are summaries of resource_details. If I had an exhaustive list of the keys for languages and details, it'd be great not to flatten them to provide easy access to this data for aggregations, leaving the resource_details as a nested type for more complex analysis and visualization.

Here's an example

                "resource_details": [
                    {
                        "chi2": 40609.63671875,
                        "entropy": 3.079699754714966,
                        "filetype": "Data",
                        "lang": "NEUTRAL",
                        "sha256": "87ab855ab53879e5b1a7e59e7958e22512440c50627115ae5758f5f5f5685e79",
                        "type": "RT_ICON"
                    },
                    {
                        "chi2": 22370.37890625,
                        "entropy": 2.9842348098754883,
                        "filetype": "Data",
                        "lang": "NEUTRAL",
                        "sha256": "60457334b5385635e2d6d5edc75619dd5dcd5b7f015d7653ab5a37520a52f5c4",
                        "type": "RT_ICON"
                    },
                    {
                        "chi2": 27408.888671875,
                        "entropy": 2.968428611755371,
                        "filetype": "ASCII text",
                        "lang": "NEUTRAL",
                        "sha256": "a67c8c551025a684511bd5932b5ad7575b352653135326587054532d5e58ab2b",
                        "type": "RT_STRING"
                    }
                ],
                "resource_langs": {
                    "NEUTRAL": 14
                },
                "resource_types": {
                    "RT_GROUP_ICON": 1,
                    "RT_ICON": 2,
                    "RT_RCDATA": 3,
                    "RT_STRING": 7,
                    "RT_VERSION": 1
                },

Compile timestamp of the PE file.
type: date

- name: packers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have this and the flattened field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably actually be removed. I restructured virustotal.packers because the data returned is consistent for both ELFs and PEs to include the analysis tool name and the resulting value. This isn't what the docs said though, so this was an attempt to provide a consistent interface with ELF data. I'll axe it.

type: keyword
description: >
Version of the compiler product.
- name: rich_pe_header_hash

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to make this into rich_header.hash.*? I would imagine that some other forensics from rich headers might be useful in other PE parsing implementations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Since it's PE specific, maybe we treat it like authentihash. We could put them all under file.pe.hash.* with authentihash, rich_header_hash, imphash. Similarly, ELF would have file.elf.hash.telfhash

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was thinking more along the lines of making it possible for someone to actually namespace whatever parsing might be done on the rich header itself. Say, if someone wanted to try and actually parse out the artifact ids/counts from the rich header itself, then by doing something like pe.rich_header.hash.* you could allow for someone else to go in and do something like pe.rich_header.entries or something.

Additionally, I believe that most of the time the hash for a rich header is usually just an md5 of the bytes in the rich header, correct? In which case pe.rich_header.hash.md5 would make sense to me.

@peasead
Copy link
Contributor

peasead commented Oct 31, 2020

Thanks for the comments, @andrewstucki

We have opened an issue to extend the PE fieldset and create the ELF fieldset.

We have the Mach-O data, but wanted to wait and see how the other two issues were handled and if we needed to make an RFC for either of them. Once we know if a new sub-fieldset (like ELF, and also Mach-O) needs and RFC or not, we planned on opening the Mach-O issue in the proper way.

That said, if you'd prefer we open the Mach-O issue now with our dataset, we certainly can.

@andrewstucki
Copy link

@peasead thanks for the heads up about the two issues. This module doesn't necessarily require the ECS extensions prior to getting merged as a module. That said, if we do decide to merge it prior to the field extensions shoring up, then we ought to make sure we don't break ECS (if any of these fields are official in the future with different types) and potentially consider shoving these fields into a new namespace.

WRT Mach-O format, no need to necessarily figure that out first. More of just a question about where you guys were going to go with this eventually.

@mergify
Copy link
Contributor

mergify bot commented Aug 3, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b dcode/virustotal-module upstream/dcode/virustotal-module
git merge upstream/master
git push upstream dcode/virustotal-module

@mergify
Copy link
Contributor

mergify bot commented Sep 22, 2021

This pull request does not have a backport label. Could you fix it @dcode? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Sep 22, 2021
@jlind23
Copy link
Collaborator

jlind23 commented Mar 31, 2022

@dcode - Closing this one as there were no activity for a while

@jlind23 jlind23 closed this Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify enhancement Filebeat Filebeat
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Module] VirusTotal Intelligence Live Hunt Filebeat Module
5 participants