Define the information represented on the malware vector? #88

kevin3567 · 2022-07-22T16:27:42Z

Hi,

I am wondering if there is a place where I can find what and how malware information are represented on the final 2381-length vector? For example:
byte_histogram = malware_vector[:256]
byte_entropy = malware_vector[256:512]
...
The reason for this is that I am trying to train a DNN, but the performance of the model is very poor (and the loss keep getting nan). Thus, I am trying to find the features causing the issue.

Thanks in advance.

lkurlandski · 2022-09-21T21:10:44Z

This is a good question.

Reading the source code, we can see there are 9 different types of features that make up the entire vector. The lengths of each different category of feature can easily be determined from reading the class definition for each. Unfortunately, the order of how these different features are arranged within the end vector depends on what version of Python you are using. Python 3.7+ specifies that the order iterating though dict must be insertion order, so the order can be determined by reading the source code. In previous versions of Python, this was an implementation-level detail, so the order cannot be known.

Here is what I get from reading the code:

ByteHistogram ------------- [0000, 0256)
ByteEntropyHistogram --- [0256, 0512)
StringExtractor ------------- [0512, 0616)
GeneralFileInfo ------------ [0616, 0626)
HeaderFileInfo ------------- [0626, 0688)
SectionInfo ------------------ [0688, 0944)
ImportsInfo ------------------ [0944, 2224)
ExportsInfo ------------------ [2224, 2352)
DataDirectories ------------ [2352, 2382)

naveennamani · 2022-11-11T05:44:26Z

Unfortunately, the order of how these different features are arranged within the end vector depends on what version of Python you are using. Python 3.7+ specifies that the order iterating though dict must be insertion order, so the order can be determined by reading the source code.

If that is the case, there should be a note on this point in the README.

The following simple code can be used for separating the raw_features into their constituent parts.

features = {
            'ByteHistogram': ByteHistogram(),
            'ByteEntropyHistogram': ByteEntropyHistogram(),
            'StringExtractor': StringExtractor(),
            'GeneralFileInfo': GeneralFileInfo(),
            'HeaderFileInfo': HeaderFileInfo(),
            'SectionInfo': SectionInfo(),
            'ImportsInfo': ImportsInfo(),
            'ExportsInfo': ExportsInfo()
    }
features_mapping = {}
feature_vector = [] # <-- load your feature vector here
for k, v in features.items():
    features_mapping[k] = feature_vector[:v.dim]
    feature_vector = feature_vector[v.dim:]

lkurlandski · 2022-11-11T13:05:33Z

Agreed it has the potential to be problematic. Upon further research, it appears that CPython >= 3.5 maintains dict order, although it is not PEP-mandated for 3.5 and 3.6. Don't quote me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define the information represented on the malware vector? #88

Define the information represented on the malware vector? #88

kevin3567 commented Jul 22, 2022

lkurlandski commented Sep 21, 2022 •

edited

Loading

naveennamani commented Nov 11, 2022

lkurlandski commented Nov 11, 2022 •

edited

Loading

Define the information represented on the malware vector? #88

Define the information represented on the malware vector? #88

Comments

kevin3567 commented Jul 22, 2022

lkurlandski commented Sep 21, 2022 • edited Loading

naveennamani commented Nov 11, 2022

lkurlandski commented Nov 11, 2022 • edited Loading

lkurlandski commented Sep 21, 2022 •

edited

Loading

lkurlandski commented Nov 11, 2022 •

edited

Loading