-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[filebeat] VirusTotal Livehunt dataset - WIP #21815
Conversation
💔 Tests Failed
Expand to view the summary
Build stats
Test stats 🧪
Test errorsExpand to view the tests failures
|
Test | Results |
---|---|
Failed | 1 |
Passed | 5135 |
Skipped | 574 |
Total | 5710 |
Genuine test errors
💔 There are test failures but not known flaky tests, most likely a genuine test failure.
- Name:
Build&Test / x-pack/filebeat-build / test_fileset_file_150_virustotal – x-pack.filebeat.tests.system.test_xpack_modules.XPackTest
a72c033
to
5a7e08b
Compare
- Provides input directly from VT API using key or via kafka topic - Implements filebeat transforms for many common [file object fields](https://developers.virustotal.com/v3.0/reference#files) - Implements filebeat transforms for many common [PE fields](https://developers.virustotal.com/v3.0/reference#pe_info) - Implements filebeat transforms for many common [ELF fields](https://developers.virustotal.com/v3.0/reference#elf_info) - Included some notes in README that I used to help develop and test this
23961c8
to
74f346f
Compare
VirusTotal ECS RFC |
So, I think we're pretty close functionally. Gonna smooth out some documentation and need to implement some tests...but not sure how that works. If anyone wants to play with this and trying to get started, I can give you a hand. I think we're about ready to bounce the schema off the @elastic/ecs team and working group to negotiate extensions and renaming for fields. |
- Provides input directly from VT API using key or via kafka topic - Implements filebeat transforms for many common [file object fields](https://developers.virustotal.com/v3.0/reference#files) - Implements filebeat transforms for many common [PE fields](https://developers.virustotal.com/v3.0/reference#pe_info) - Implements filebeat transforms for many common [ELF fields](https://developers.virustotal.com/v3.0/reference#elf_info) - Included some notes in README that I used to help develop and test this
60f0b3c
to
da2d1db
Compare
… dcode/virustotal-module
… dcode/virustotal-module
Pinging @elastic/security-external-integrations (Team:Security-External Integrations) |
This isn't perfect, but it passes the local tests now and has docs by @peasead. I would welcome feedback on structure and/or style |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look at some of the field mappings, haven't done a pass over everything yet, but awhile ago took a look at how we'd map some more detailed binary data info (based off of an experiment) into ECS-style fields and, accordingly, highlighted some of the PE/ELF info that I had thoughts about in this PR.
Is there a plan to do any Mach-O binaries?
description: > | ||
Number of ELF Section Headers. | ||
type: long | ||
- name: sections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure given that virus total returns information about whether artifacts are malicious to begin with, but I imagine that entropy calculations and or hashes might be useful to retain here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the intention is to keep all the info for the time being. If users don't want a particular fieldset, it can be dropped in the filebeat config or ingest processor. The section data has chi2 calculations and entropy. Virustotal doesn't provide an overall status of malicious or benign, but offers community votes of that, and individual engine assessments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to comment about imported symbols below, we could normalize section data with something like:
file.*.sections
:
{
"virtual_address": 4096,
"size": 2353664,
"entropy": 6.37,
"name": ".text",
"flags": "rx"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After working through several examples and reading on the various binary executable formats, I've come up with this. Thoughts @andrewstucki ?
// Abstract structure for all binary types, missing fields for a given data source will be excluded
{
name: "[keyword] Name of code section",
physical_offset: "[keyword] Offset of the section from the beginning of the segment, in hex",
physical_size: "[long] Size of the code section in the file in bytes",
virtual_address: "[keyword] relative virtual memory address when loaded",
virtual_size: "[long] Size of the section in bytes when loaded into memory",
flags: "[keyword] List of flag values as strings for this section",
type: "[keyword] Section type as string, if applicable",
segment_name: "[keyword] Name of segment for this section, if applicable"
}
// Mach-O example
{
file.macho.sections: [
{
name: "__nl_symbol_ptr",
flags: ["S_8BYTE_LITERALS"],
type: "S_CSTRING_LITERALS",
segment_name: "__DATA"
}, ...
]
}
// ELF example
{
file.elf.sections: [
{
name: ".data",
physical_offset: "0x3000",
physical_size: 16,
virtual_address: "0x4000",
flags: ["WA"], // This is how VT presents the data. Pretty sure this maps to ["WRITE", "ALLOC"], but I don't have an exhaustive mapping
type: "PROGBITS"
}, ...
]
}
// PE example
{
file.pe.sections: [
{
name: ".data",
physical_size: 2542592,
virtual_address: "0x2DE000",
virtual_size: 2579264,
flags: ["rw"], // Again, this is how VT presents it. Likely maps to ["MEM_READ", "MEM_WRITE"], but I don't have an exhaustive mapping
type: ".data",
entropy: 6.83,
chi2: 13360996
}, ...
]
}
I'm least pleased by my Mach-O example, but I think that's mostly limited to how VT provides the data currently. It provides offset info for each segment, and then lists sections that exist within the segment with no info at all. This is the only reason, I think to even mention the segment name, though that could be omitted and and be listed within each segment data as a list of included sections.
Finally, I think this at least works for a common fieldset of section data. The flags we can improve over time since it will be a list of keywords, and for PE, I think it's hard-coded as an attribute of the section name/type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcode that looks pretty similar to what I was thinking. Thinking through the flags
bit trips me up a little too--I'm thinking that eventually we may want some verbiage in the description that says something to the effect of "use whatever constant name is found in the spec/OS headers" >_>. If we wanted to be strict about it, a VT filebeat module could always just normalize the VT payload to whatever we wanted.
Also, for reference, sections do have offset and size info associated with them, so despite the VT api shortcomings, pretty sure the same fields would still be useful. I'd be fine suggesting the entropy
and chi2
calculations as fields too, at least as a first pass, in the RFC. Statistical byte calculations seem pretty common in the binary analysis-side of security.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree on all points. on it
Type of exported symbol | ||
type: keyword | ||
default_field: false | ||
- name: imports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it would make sense for this level to describe an actual linked in library and for the stuff currently nested here (i.e. name
, type
, etc.) to specify the symbols imported by a library. Otherwise you get symbols free of context from where they're actually being imported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Ideally, I'd like to see a common representation across ELF, PE, and Mach-O. Unfortunately, these formats don't work the same, especially in the way they import symbols. I think making exports and imports nested
rather than a group
makes sense to maintain context. Making these a nested dictionary with common fields for each binary type might be the right answer. Not all binary types will have all fields populated, but at least consistent across formats. I'll play with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it's totally possible to try and resolve the libraries that symbols come from in ELF format, see example. There's actually default support for resolving the libraries in Go's standard library. I couldn't tell if the ndjson
example that is dropped here actually does that as part of the VirusTotal service for ELF files, but if it does, it would probably make sense to scope these.
Edit: BTW, Go supports this through the GNU symbol versioning tables introduced to support Linux dynamic symbol versioning, so if a symbol isn't versioned, you'll be hard-pressed to get this information from the binary itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to have a common interface across PE, ELF, and MachO for this. LIEF actually does this as an analytic framework, but VT doesn't expose this data equally across all binary types. We could implement a common fieldset for imports, and applications can populate it as they are able to.
Proposal:
file.*.imported_symbols
:
{
"name": "my_symbol", "size": 0, "value": 0, "type": "function", "library_name": "my_library.dll"
}
In the case of PE, the VT data would permit populating symbol name, library name, and we can derive a type of "function". For ELF, the data provides symbol name and type (the samples I've seen), for Mach-O... VT doesn't give us any symbols... just a list of linked libraries, which could feasibly go somewhere else as a flat list, say file.*.linked_libraries
.
Anything not provided by the source (VT in this case) would be omitted. Another application could feasibly populate this data with much greater detail. The library_name for ELF could be resolved as you say, but it's not coded in the binary specifically (I think).
type: flattened | ||
description: > | ||
If the PE contains resources, some info about them | ||
- name: resource_languages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering, does VT return language/type information tied to the specific resources it's enumerating? Because I would imagine this and the field below would show up in the resource_details
albeit not aggregated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. resource_types
and resource_languages
are summaries of resource_details
. If I had an exhaustive list of the keys for languages and details, it'd be great not to flatten them to provide easy access to this data for aggregations, leaving the resource_details
as a nested type for more complex analysis and visualization.
Here's an example
"resource_details": [
{
"chi2": 40609.63671875,
"entropy": 3.079699754714966,
"filetype": "Data",
"lang": "NEUTRAL",
"sha256": "87ab855ab53879e5b1a7e59e7958e22512440c50627115ae5758f5f5f5685e79",
"type": "RT_ICON"
},
{
"chi2": 22370.37890625,
"entropy": 2.9842348098754883,
"filetype": "Data",
"lang": "NEUTRAL",
"sha256": "60457334b5385635e2d6d5edc75619dd5dcd5b7f015d7653ab5a37520a52f5c4",
"type": "RT_ICON"
},
{
"chi2": 27408.888671875,
"entropy": 2.968428611755371,
"filetype": "ASCII text",
"lang": "NEUTRAL",
"sha256": "a67c8c551025a684511bd5932b5ad7575b352653135326587054532d5e58ab2b",
"type": "RT_STRING"
}
],
"resource_langs": {
"NEUTRAL": 14
},
"resource_types": {
"RT_GROUP_ICON": 1,
"RT_ICON": 2,
"RT_RCDATA": 3,
"RT_STRING": 7,
"RT_VERSION": 1
},
Compile timestamp of the PE file. | ||
type: date | ||
|
||
- name: packers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why have this and the flattened
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably actually be removed. I restructured virustotal.packers
because the data returned is consistent for both ELFs and PEs to include the analysis tool name and the resulting value. This isn't what the docs said though, so this was an attempt to provide a consistent interface with ELF data. I'll axe it.
type: keyword | ||
description: > | ||
Version of the compiler product. | ||
- name: rich_pe_header_hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to make this into rich_header.hash.*
? I would imagine that some other forensics from rich headers might be useful in other PE parsing implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Since it's PE specific, maybe we treat it like authentihash
. We could put them all under file.pe.hash.*
with authentihash
, rich_header_hash
, imphash
. Similarly, ELF would have file.elf.hash.telfhash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was thinking more along the lines of making it possible for someone to actually namespace whatever parsing might be done on the rich header itself. Say, if someone wanted to try and actually parse out the artifact ids/counts from the rich header itself, then by doing something like pe.rich_header.hash.*
you could allow for someone else to go in and do something like pe.rich_header.entries
or something.
Additionally, I believe that most of the time the hash for a rich header is usually just an md5 of the bytes in the rich header, correct? In which case pe.rich_header.hash.md5
would make sense to me.
Thanks for the comments, @andrewstucki We have opened an issue to extend the PE fieldset and create the ELF fieldset. We have the Mach-O data, but wanted to wait and see how the other two issues were handled and if we needed to make an RFC for either of them. Once we know if a new sub-fieldset (like ELF, and also Mach-O) needs and RFC or not, we planned on opening the Mach-O issue in the proper way. That said, if you'd prefer we open the Mach-O issue now with our dataset, we certainly can. |
@peasead thanks for the heads up about the two issues. This module doesn't necessarily require the ECS extensions prior to getting merged as a module. That said, if we do decide to merge it prior to the field extensions shoring up, then we ought to make sure we don't break ECS (if any of these fields are official in the future with different types) and potentially consider shoving these fields into a new namespace. WRT Mach-O format, no need to necessarily figure that out first. More of just a question about where you guys were going to go with this eventually. |
Since elastic#23183 was merged, `fields.yml` can now properly specify types for nested object properties
This pull request is now in conflicts. Could you fix it? 🙏
|
This pull request does not have a backport label. Could you fix it @dcode? 🙏
NOTE: |
@dcode - Closing this one as there were no activity for a while |
THIS IS CURRENTLY IN DRAFT
What does this PR do?
Adds initial support for streaming VirusTotal Livehunt data via Filebeat
httpjson
input from VT API endpoint or via akafka
broker input, allowing multi-stage pipeline (also helpful for testing).Why is it important?
Data from VirusTotal (VT) is important for threat research. The Livehunt feature allows organizations to enable one or many YARA rules in one or many rulesets. This module uses the Livehunt Notification API to stream VT
file
objects into an ECS-compatible mapping, where possible, and an ECS-styled mapping elsewhere.VirusTotal is just one source of
file
events, which are a bit different than other security-related logging. Making this data available and standardized in Elasticsearch will allow analysis that combines the existing security event logging from network and endpoints with the file objects that traverse those mediums.Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
My current testing procedures are documented in
x-pack/filebeat/module/virustotal/README.md
. I will attach raw ndjson logs that contain a sample of original events covering the use cases.Related issues
Use cases
Screenshots
Logs
TODO