Skip to content

Train a set of machine-learning models on known goodware/malware. Using the models, predict the likelihood that an unknown sample is malware.

License

Notifications You must be signed in to change notification settings

disruptive-solutions/arbiter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arbiter

Train and use a set of machine-learning models on known clean PE files and known malware. Once trained, the models are used to predict the likelihood that an unknown sample is malware.

It should be noted that this is only a library used to create and use models. Arbiter does not come with premade models, malware, or goodware. It is up to the user to find malware/goodware. See below for some recommended repositories.

Recommended Data Sets

Malware

Finding malware is pretty easy these days.

Goodware

Our strategy for collecting clean samples was to use the PE files that come in Windows as well as install some common software. We used Ninite to create a mega-installer and then pulled all PE files from %ProgramFiles% and %ProgramFiles(x86)%. If you think of better ways, let us know.

Research

Usage

To install Arbiter, simply run python setup.py install (preferrably in a virtual environment)

Train

arbiter train -m /path/to/malware/* -g /path/to/goodware/*

Using 7 data-points from the PE structure (using pefile):

  • Debug size
  • Image version
  • Import relative virtual address
  • Export size
  • Resource size
  • Number of sections
  • Virtual size of the second section

The train module creates the following models:

  • sklearn.linear_model.LogisticRegression
  • sklearn.linear_model.LogisticRegressionCV
  • sklearn.linear_model.SGDClassifier
  • sklearn.naive_bayes.GaussianNB
  • sklearn.svm.LinearSVC
  • xgboost.sklearn.XGBClassifier

Finally, pickle and write the models (defaults ./arbiter_models.pickle).

Predict

arbiter /path/to/unknown/samples/*

Using the models from the train subcommand, predict the liklihood that the unknown samples are malicious. By default, this step writes a JSON file: ./arbiter_results.json. The output of the predict looks like this:

{
 "LogisticRegression": {"0": 0.6189994176326301}, 
 "LogisticRegressionCV": {"0": 0.35073582481230403}, 
 "XGBClassifier": {"0": 0.5281684994697571}, 
 "GaussianNB": {"0": 0.6666666666666666}, 
 "SGDClassifier": {"0": 0.0}, 
 "LinearSVC": {"0": 1}
}

The keys of the above dictionaries refer to the position of each sample in the input list. This is due to change because we feel that it is not intuitive.

Future Goals

  • Group similar samples
  • Ability to analyze more than just Windows PE

About

Train a set of machine-learning models on known goodware/malware. Using the models, predict the likelihood that an unknown sample is malware.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages