Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define the information represented on the malware vector? #88

Open
kevin3567 opened this issue Jul 22, 2022 · 3 comments
Open

Define the information represented on the malware vector? #88

kevin3567 opened this issue Jul 22, 2022 · 3 comments

Comments

@kevin3567
Copy link

Hi,

I am wondering if there is a place where I can find what and how malware information are represented on the final 2381-length vector? For example:
byte_histogram = malware_vector[:256]
byte_entropy = malware_vector[256:512]
...
The reason for this is that I am trying to train a DNN, but the performance of the model is very poor (and the loss keep getting nan). Thus, I am trying to find the features causing the issue.

Thanks in advance.

@lkurlandski
Copy link

lkurlandski commented Sep 21, 2022

This is a good question.

Reading the source code, we can see there are 9 different types of features that make up the entire vector. The lengths of each different category of feature can easily be determined from reading the class definition for each. Unfortunately, the order of how these different features are arranged within the end vector depends on what version of Python you are using. Python 3.7+ specifies that the order iterating though dict must be insertion order, so the order can be determined by reading the source code. In previous versions of Python, this was an implementation-level detail, so the order cannot be known.

Here is what I get from reading the code:

ByteHistogram ------------- [0000, 0256)
ByteEntropyHistogram --- [0256, 0512)
StringExtractor ------------- [0512, 0616)
GeneralFileInfo ------------ [0616, 0626)
HeaderFileInfo ------------- [0626, 0688)
SectionInfo ------------------ [0688, 0944)
ImportsInfo ------------------ [0944, 2224)
ExportsInfo ------------------ [2224, 2352)
DataDirectories ------------ [2352, 2382)

@naveennamani
Copy link

Unfortunately, the order of how these different features are arranged within the end vector depends on what version of Python you are using. Python 3.7+ specifies that the order iterating though dict must be insertion order, so the order can be determined by reading the source code.

If that is the case, there should be a note on this point in the README.

The following simple code can be used for separating the raw_features into their constituent parts.

features = {
            'ByteHistogram': ByteHistogram(),
            'ByteEntropyHistogram': ByteEntropyHistogram(),
            'StringExtractor': StringExtractor(),
            'GeneralFileInfo': GeneralFileInfo(),
            'HeaderFileInfo': HeaderFileInfo(),
            'SectionInfo': SectionInfo(),
            'ImportsInfo': ImportsInfo(),
            'ExportsInfo': ExportsInfo()
    }
features_mapping = {}
feature_vector = [] # <-- load your feature vector here
for k, v in features.items():
    features_mapping[k] = feature_vector[:v.dim]
    feature_vector = feature_vector[v.dim:]

@lkurlandski
Copy link

lkurlandski commented Nov 11, 2022

Agreed it has the potential to be problematic. Upon further research, it appears that CPython >= 3.5 maintains dict order, although it is not PEP-mandated for 3.5 and 3.6. Don't quote me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants