-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define the information represented on the malware vector? #88
Comments
This is a good question. Reading the source code, we can see there are 9 different types of features that make up the entire vector. The lengths of each different category of feature can easily be determined from reading the class definition for each. Unfortunately, the order of how these different features are arranged within the end vector depends on what version of Python you are using. Python 3.7+ specifies that the order iterating though dict must be insertion order, so the order can be determined by reading the source code. In previous versions of Python, this was an implementation-level detail, so the order cannot be known. Here is what I get from reading the code: ByteHistogram ------------- [0000, 0256) |
If that is the case, there should be a note on this point in the README. The following simple code can be used for separating the raw_features into their constituent parts. features = {
'ByteHistogram': ByteHistogram(),
'ByteEntropyHistogram': ByteEntropyHistogram(),
'StringExtractor': StringExtractor(),
'GeneralFileInfo': GeneralFileInfo(),
'HeaderFileInfo': HeaderFileInfo(),
'SectionInfo': SectionInfo(),
'ImportsInfo': ImportsInfo(),
'ExportsInfo': ExportsInfo()
}
features_mapping = {}
feature_vector = [] # <-- load your feature vector here
for k, v in features.items():
features_mapping[k] = feature_vector[:v.dim]
feature_vector = feature_vector[v.dim:] |
Agreed it has the potential to be problematic. Upon further research, it appears that CPython >= 3.5 maintains dict order, although it is not PEP-mandated for 3.5 and 3.6. Don't quote me. |
Hi,
I am wondering if there is a place where I can find what and how malware information are represented on the final 2381-length vector? For example:
byte_histogram = malware_vector[:256]
byte_entropy = malware_vector[256:512]
...
The reason for this is that I am trying to train a DNN, but the performance of the model is very poor (and the loss keep getting nan). Thus, I am trying to find the features causing the issue.
Thanks in advance.
The text was updated successfully, but these errors were encountered: