diff --git a/README.md b/README.md index 29cfea26..91eeb9c3 100644 --- a/README.md +++ b/README.md @@ -6,83 +6,88 @@ Introduction ------------ MultiScanner is a file analysis framework that assists the user in evaluating a set of files by automatically running a suite of tools for the user and aggregating the output. -Tools can be custom built python scripts, web APIs, software running on another machine, etc. +Tools can be custom built Python scripts, web APIs, software running on another machine, etc. Tools are incorporated by creating modules that run in the MultiScanner framework. Modules are designed to be quickly written and easily incorporated into the framework. Currently written and maintained modules are related to malware analytics, but the framework is not limited to that -scope. For a list of modules you can look in [modules](modules), descriptions and config -options can be found in [docs/modules.md](docs/modules.md) +scope. For a list of modules you can look in [modules/](modules). Descriptions and config +options can be found on the [Analysis Modules](http://multiscanner.readthedocs.io/en/latest/use/use-analysis-mods.html) page. -Requirements ------------- -Python 3.6 is recommended. Compatibility with 2.7+ and -3.4+ is supported but not as thoroughly maintained and tested. Please submit an issue -or a pull request fixing any issues found with other versions of Python. +MultiScanner also supports a distributed workflow for sample storage, analysis, and +report viewing. This functionality includes a web interface, a REST API, a distributed +file system (GlusterFS), distributed report storage / searching (Elasticsearch), and +distributed task management (Celery / RabbitMQ). Please see [Architecture](http://multiscanner.readthedocs.io/en/latest/arch.html) for more details. +Usage +----- -An installer script is included in the project [install.sh](), which -installs the prerequisites on most systems. +MultiScanner can be used as a command-line interface, a Python API, or a +distributed system with a web interface. See the documentation for more detailed +information on [installation](http://multiscanner.readthedocs.io/en/latest/install.html) and [usage](http://multiscanner.readthedocs.io/en/latest/use/index.html). -Installation ------------- -### MultiScanner ### -If you're running on a RedHat or Debian based linux distribution you should try and run -[install.sh](). Otherwise the required python packages are defined in -[requirements.txt](). - -MultiScanner must have a configuration file to run. Generate the MultiScanner default -configuration by running `python multiscanner.py init` after cloning the repository. -This command can be used to rewrite the configuration file to its default state or, -if new modules have been written, to add their configuration to the configuration -file. - -### Analytic Machine ### -Default modules have the option to be run locally or via SSH. The development team -runs MultiScanner on a Linux host and hosts the majority of analytical tools on -a separate Windows machine. The SSH server used in this environment is freeSSHd -from . - -A network share accessible to both the MultiScanner and the Analytic Machines is -required for the multi-machine setup. Once configured, the network share path must -be identified in the configuration file, config.ini. To do this, set the `copyfilesto` -option under `[main]` to be the mount point on the system running MultiScanner. -Modules can have a `replacement path` option, which is the network share mount point -on the analytic machine. - -Module Writing --------------- -Modules are intended to be quickly written and incorporated into the framework. -A finished module must be placed in the modules folder before it can be used. The -configuration file does not need to be manually updated. See [docs/module\_writing.md]() -for more information. - -Module Configuration --------------------- -Modules are configured within the configuration file, config.ini. See -[docs/modules.md]() for more information. - -Python API ----------- -MultiScanner can be incorporated as a module in another projects. Below is a simple -example of how to import MultiScanner into a Python script. +### Command-Line ### + +Install Python (2.7 or 3.4+) if you haven't already. + +Then run the following (substituting the actual file you want to scan for ``): + +``` bash +$ git clone https://github.com/mitre/multiscanner.git +$ cd multiscanner +$ sudo -HE ./install.sh +$ python multiscanner.py init +``` + +This will generate a default configuration for you. Check `config.ini` to see what +modules are enabled. See [Configuration](http://multiscannerdocs.readthedocs.io/en/latest/install.html#configuration) for more information. + +Now you can scan a file (substituting the actual file you want to scan for ``): + +``` bash +$ python multiscanner.py +``` + +You can run the following to get a list of all of MultiScanner's command-line options: + +``` bash +$ python multiscanner.py --help +``` + +**Note**: If you are not on a RedHat or Debian based Linux distribution, instead of +running the `install.sh` script, install pip (if you haven't already) and run the +following: + +``` bash +$ pip install -r requirements.txt +``` + +### Python API ### ``` python import multiscanner -output = multiscanner.multiscan(FileList) -Results = multiscanner.parse_reports(output, python=True) +multiscanner.config_init(filepath) +output = multiscanner.multiscan(file_list) +results = multiscanner.parse_reports(output, python=True) ``` -Results is a dictionary object where each key is a filename of a scanned file. +### Web Interface ### + +Install the latest versions of [Docker](https://docs.docker.com/engine/installation/) +and [Docker Compose](https://docs.docker.com/compose/install/) if you haven't already. + +``` bash +$ git clone https://github.com/mitre/multiscanner.git +$ cd multiscanner +$ docker-compose up +``` -`multiscanner.config_init(filepath)` will create a default configuration file at -the location defined by filepath. +You may have to wait a while until all the services are up and running, but then you +can use the web interface by going to `http://localhost:8000` in your web browser. -Distributed MultiScanner ------------------------- -MultiScanner is also part of a distributed, scalable file analysis framework, complete with distributed task management, web interface, REST API, and report storage. Please set [Distributed Multiscanner]() for more details. Additionally, we distribute a standalone Docker container with the base set of features (web UI, REST API, ElasticSearch node) as an introduction to the capabilities of this Distributed MultiScanner. See [here]() for more details. (*Note*: this standalone container should not be used in production, it is simply a primer on what a full installation would look like). +*Note*: this should not be used in production; it is simply an introduction to what a +full installation would look like. See [here](http://multiscanner.readthedocs.io/en/latest/install.html#standalone-docker-installation) for more details. -Other Reading +Documentation ------------- -For more information on module configuration or writing modules check the -[docs]() folder. +For more information, see the [full documentation](http://multiscanner.readthedocs.io/) on ReadTheDocs. diff --git a/docs/Makefile b/docs/Makefile new file mode 100644 index 00000000..645e3342 --- /dev/null +++ b/docs/Makefile @@ -0,0 +1,20 @@ +# Minimal makefile for Sphinx documentation +# + +# You can set these variables from the command line. +SPHINXOPTS = +SPHINXBUILD = sphinx-build +SPHINXPROJ = MultiScanner +SOURCEDIR = . +BUILDDIR = _build + +# Put it first so that "make" without argument is like "make help". +help: + @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) + +.PHONY: help Makefile + +# Catch-all target: route all unknown targets to Sphinx using the new +# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). +%: Makefile + @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) \ No newline at end of file diff --git a/docs/_static/img/Selection_001.png b/docs/_static/img/Selection_001.png new file mode 100644 index 00000000..8034bbec Binary files /dev/null and b/docs/_static/img/Selection_001.png differ diff --git a/docs/_static/img/Selection_002.png b/docs/_static/img/Selection_002.png new file mode 100644 index 00000000..2e39036c Binary files /dev/null and b/docs/_static/img/Selection_002.png differ diff --git a/docs/_static/img/Selection_003.png b/docs/_static/img/Selection_003.png new file mode 100644 index 00000000..3c2a379c Binary files /dev/null and b/docs/_static/img/Selection_003.png differ diff --git a/docs/_static/img/Selection_004.png b/docs/_static/img/Selection_004.png new file mode 100644 index 00000000..7e2a1694 Binary files /dev/null and b/docs/_static/img/Selection_004.png differ diff --git a/docs/_static/img/Selection_005.png b/docs/_static/img/Selection_005.png new file mode 100644 index 00000000..68b418bf Binary files /dev/null and b/docs/_static/img/Selection_005.png differ diff --git a/docs/_static/img/Selection_006.png b/docs/_static/img/Selection_006.png new file mode 100644 index 00000000..f7120ce1 Binary files /dev/null and b/docs/_static/img/Selection_006.png differ diff --git a/docs/_static/img/Selection_007.png b/docs/_static/img/Selection_007.png new file mode 100644 index 00000000..71d0b651 Binary files /dev/null and b/docs/_static/img/Selection_007.png differ diff --git a/docs/_static/img/Selection_008.png b/docs/_static/img/Selection_008.png new file mode 100644 index 00000000..e9689555 Binary files /dev/null and b/docs/_static/img/Selection_008.png differ diff --git a/docs/_static/img/Selection_009.png b/docs/_static/img/Selection_009.png new file mode 100644 index 00000000..a6c48562 Binary files /dev/null and b/docs/_static/img/Selection_009.png differ diff --git a/docs/_static/img/Selection_010.png b/docs/_static/img/Selection_010.png new file mode 100644 index 00000000..fc0deef9 Binary files /dev/null and b/docs/_static/img/Selection_010.png differ diff --git a/docs/_static/img/Selection_011.png b/docs/_static/img/Selection_011.png new file mode 100644 index 00000000..f5fb3ba1 Binary files /dev/null and b/docs/_static/img/Selection_011.png differ diff --git a/docs/_static/img/Selection_012.png b/docs/_static/img/Selection_012.png new file mode 100644 index 00000000..ca33a0af Binary files /dev/null and b/docs/_static/img/Selection_012.png differ diff --git a/docs/_static/img/Selection_013.png b/docs/_static/img/Selection_013.png new file mode 100644 index 00000000..070fd454 Binary files /dev/null and b/docs/_static/img/Selection_013.png differ diff --git a/docs/_static/img/Selection_014.png b/docs/_static/img/Selection_014.png new file mode 100644 index 00000000..279a1ce9 Binary files /dev/null and b/docs/_static/img/Selection_014.png differ diff --git a/docs/_static/img/Selection_015.png b/docs/_static/img/Selection_015.png new file mode 100644 index 00000000..8efb7f1d Binary files /dev/null and b/docs/_static/img/Selection_015.png differ diff --git a/docs/_static/img/Selection_016.png b/docs/_static/img/Selection_016.png new file mode 100644 index 00000000..34313206 Binary files /dev/null and b/docs/_static/img/Selection_016.png differ diff --git a/docs/_static/img/Selection_017.png b/docs/_static/img/Selection_017.png new file mode 100644 index 00000000..9f09cceb Binary files /dev/null and b/docs/_static/img/Selection_017.png differ diff --git a/docs/_static/img/Selection_018.png b/docs/_static/img/Selection_018.png new file mode 100644 index 00000000..1d822a9d Binary files /dev/null and b/docs/_static/img/Selection_018.png differ diff --git a/docs/_static/img/Selection_019.png b/docs/_static/img/Selection_019.png new file mode 100644 index 00000000..4b11929c Binary files /dev/null and b/docs/_static/img/Selection_019.png differ diff --git a/docs/_static/img/Selection_020.png b/docs/_static/img/Selection_020.png new file mode 100644 index 00000000..66b5c5fc Binary files /dev/null and b/docs/_static/img/Selection_020.png differ diff --git a/docs/_static/img/Selection_021.png b/docs/_static/img/Selection_021.png new file mode 100644 index 00000000..1b469bdd Binary files /dev/null and b/docs/_static/img/Selection_021.png differ diff --git a/docs/_static/img/Selection_022.png b/docs/_static/img/Selection_022.png new file mode 100644 index 00000000..16baa175 Binary files /dev/null and b/docs/_static/img/Selection_022.png differ diff --git a/docs/_static/img/Selection_023.png b/docs/_static/img/Selection_023.png new file mode 100644 index 00000000..76e37bd8 Binary files /dev/null and b/docs/_static/img/Selection_023.png differ diff --git a/docs/_static/img/Selection_024.png b/docs/_static/img/Selection_024.png new file mode 100644 index 00000000..2a5c7889 Binary files /dev/null and b/docs/_static/img/Selection_024.png differ diff --git a/docs/_static/img/arch1.png b/docs/_static/img/arch1.png new file mode 100644 index 00000000..bbf52ea1 Binary files /dev/null and b/docs/_static/img/arch1.png differ diff --git a/docs/_static/img/arch2.png b/docs/_static/img/arch2.png new file mode 100644 index 00000000..e1568c97 Binary files /dev/null and b/docs/_static/img/arch2.png differ diff --git a/docs/_static/img/overview.png b/docs/_static/img/overview.png new file mode 100644 index 00000000..56c7c3a7 Binary files /dev/null and b/docs/_static/img/overview.png differ diff --git a/docs/_static/theme_overrides.css b/docs/_static/theme_overrides.css new file mode 100644 index 00000000..63ee6cc7 --- /dev/null +++ b/docs/_static/theme_overrides.css @@ -0,0 +1,13 @@ +/* override table width restrictions */ +@media screen and (min-width: 767px) { + + .wy-table-responsive table td { + /* !important prevents the common CSS stylesheets from overriding + this as on RTD they are loaded after this stylesheet */ + white-space: normal !important; + } + + .wy-table-responsive { + overflow: visible !important; + } +} diff --git a/docs/analytics.md b/docs/analytics.md deleted file mode 100644 index 5eb3a446..00000000 --- a/docs/analytics.md +++ /dev/null @@ -1,47 +0,0 @@ -# Analytics # -Enabling analytics and advanced queries is the primary advantage of running -several tools against a sample, extracting as much information as possible, and -storing the output in a common datastore. - -The following are some example types of analytics and queries that may be of -interest: - -- cluster samples -- outlier samples -- samples for deep-dive analysis -- gaps in current toolset -- machine learning analytics on tool outputs -- others - -## ssdeep Comparison ## -Fuzzy hashing is an effective method to identify similar files based on common -byte strings despite changes in the byte order and strcuture of the files. -[ssdeep](https://ssdeep-project.github.io/ssdeep/index.html) provides a fuzzy -hash implementation and provides the capability to compare hashes. - -Comparing ssdeep hashes at scale is a challenge. [[1]](https://www.virusbulletin.com/virusbulletin/2015/11/optimizing-ssdeep-use-scale/) -originally described a method for comparing ssdeep hashes at scale. - -The ssdeep analytic computes ```ssdeep.compare``` for all samples where the -result is non-zero and provides the capability to return all samples clustered -based on the ssdeep hash. - -### Elasticsearch ### -When possible, it can be effective to push work to the Elasticsearch cluster -which support horizontal scaling. For the ssdeep comparison, Elasticsearch -[NGram Tokenizers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html) -are used to compute 7-grams of the chunk and double-chunk portions -of the ssdeep hash as described here [[2]](http://www.intezer.com/intezer-community-tip-ssdeep-comparisons-with-elasticsearch/). -This prevents ever comparing two ssdeep hashes where the result will be zero. - -### Python ### -Because we need to compute ```ssdeep.compare```, the ssdeep analytic cannot be -done entirely in Elasticsearch. Python is used to query Elasicsearch, compute -```ssdeep.compare``` on the results, and update the documents in Elasticsearch. - -### Deployment ### -[celery beat](http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html) -is used to schedule and kick off the ssdeep comparison task nightly at 2am -local time, when the system is experiencing less load from users. This ensures -that the analytic will be run on all samples without adding an exorbinant load -to the system. diff --git a/docs/arch.rst b/docs/arch.rst new file mode 100644 index 00000000..abda96bf --- /dev/null +++ b/docs/arch.rst @@ -0,0 +1,104 @@ +Architecture +============ + +High-level Architecture +----------------------- +There are seven primary components of the MultiScanner architecture, as described below and illustrated in the associated diagram. + +.. figure:: _static/img/arch1.png + :align: center + :scale: 45 % + :alt: MultiScanner Architecture + + MultiScanner Architecture +.. + +**Web Frontend** + +The web application runs on `Flask `_, uses `Bootstrap `_ and `jQuery `_, and is served via Apache. It is essentially an aesthetic wrapper around the REST API. All data and services provided are also available by querying the REST API. + + +**REST API** + +The REST API is also powered by Flask and served via Apache. It has an underlying PostgreSQL database to facilitate task tracking. Additionally, it acts as a gateway to the backend Elasticsearch document store. Searches entered into the web UI will be routed through the REST API and passed to the Elasticsearch cluster. This abstracts the complexity of querying Elasticsearch and gives the user a simple web interface to work with. + +**Task Queue** + +We use Celery as our distributed task queue. + +**Task Tracking** + +PostgreSQL is our task management database. It is here that we keep track of scan times, samples, and the status of tasks (pending, complete, failed). + +**Distributed File System** + +GlusterFS is our distributed file system. Each component that needs access to the raw samples mounts the share via FUSE. We selected GlusterFS because it is more performant in our use case -- storing a large number of small samples -- than a technology like HDFS would be. + +**Worker Nodes** + +The worker nodes are Celery clients running the MultiScanner Python application. Additionally, we implemented some batching within Celery to improve the performance of our worker nodes (which operate better at scale). + +A worker node will wait until there are 100 samples in its queue or 60 seconds have passed (whichever happens first) before kicking off its scan (these values are configurable). All worker nodes have the GlusterFS mounted, which gives access to the samples for scanning. In our setup, we co-locate the worker nodes with the GlusterFS nodes in order to reduce the network load of workers pulling samples from GlusterFS. + +**Report Storage** + +We use Elasticsearch to store the results of our file scans. This is where the true power of this system lies. Elasticsearch allows for performant, full text searching across all our reports and modules. This allows fast access to interesting details from your malware analysis tools, pivoting between samples, and powerful analytics on report output. + +.. _complete-workflow: + +Complete Workflow +----------------- +Each step of the MultiScanner workflow is described below the diagram. + +.. figure:: _static/img/arch2.png + :align: center + :scale: 50 % + :alt: MultiScanner Workflow + + MultiScanner Workflow +.. + +1. The user submits a sample file through the Web UI (or REST API) + +2. The Web UI (or REST API): + + a. Stores the file in the distributed file system (GlusterFS) + b. Places the task on the task queue (Celery) + c. Adds an entry to the task management database (PostgreSQL) + +3. A worker node: + + a. Pulls the task from the Celery task queue + b. Retrieves the corresponding sample file from the GlusterFS via its SHA256 value + c. Analyzes the file + d. Generates a JSON blob and indexes it into Elasticsearch + e. Updates the task management database with the task status ("complete") + +4. The Web UI (or REST API): + + a. Gets report ID associated with the Task ID + b. Pulls analysis report from the Elasticsearch datastore + +Analysis +-------- +Analysis tools are integrated into MultiScanner via modules running in the MultiScanner framework. Tools can be custom built Python scripts, web APIs, or software applications running on different machines. Catagories of existing modules include AV scanning, sandbox detonation, metadata extraction, and signature scanning. Modules can be enabled/disabled via a configuration file. Details are provided in the :ref:`analysis-modules` section. + +Analytics +--------- +Enabling analytics and advanced queries is the primary advantage of running several tools against a sample, extracting as much information as possible, and storing the output in a common datastore. For example, the following types of analytics and queries are possible: + +* cluster samples +* outlier samples +* samples for deep-dive analysis +* gaps in current toolset +* machine learning analytics on tool outputs + +Reporting +--------- +Analysis data captured or generated by MultiScanner is accessible in three ways: + +* MultiScanner Web User Interface – Content in the Elasticsearch database is viewable through the Web UI. See :ref:`web-ui` section for details. + +* MultiScanner Reports – MultiScanner reports reflect the content of the MultiScanner database and are provided in raw JSON and PDF formats. These reports capture all content associated with a sample. + +* STIX-based reports *will soon be* available in multiple formats: JSON, PDF, HTML, and text. diff --git a/docs/conf.py b/docs/conf.py new file mode 100644 index 00000000..216e6993 --- /dev/null +++ b/docs/conf.py @@ -0,0 +1,172 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# +# MultiScanner documentation build configuration file, created by +# sphinx-quickstart on Fri Dec 22 13:35:06 2017. +# +# This file is execfile()d with the current directory set to its +# containing dir. +# +# Note that not all possible configuration values are present in this +# autogenerated file. +# +# All configuration values have a default; values that are commented out +# serve to show the default. + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. +# +# import os +# import sys +# sys.path.insert(0, os.path.abspath('.')) + + +# -- General configuration ------------------------------------------------ + +# If your documentation needs a minimal Sphinx version, state it here. +# +# needs_sphinx = '1.0' + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [] + +# Add any paths that contain templates here, relative to this directory. +templates_path = ['_templates'] + +# The suffix(es) of source filenames. +# You can specify multiple suffix as a list of string: +# +# source_suffix = ['.rst', '.md'] +source_suffix = '.rst' + +# The master toctree document. +master_doc = 'index' + +# General information about the project. +project = 'MultiScanner' +copyright = '2017, MITRE' +author = 'MITRE' + +# The version info for the project you're documenting, acts as replacement for +# |version| and |release|, also used in various other places throughout the +# built documents. +# +# The short X.Y version. +version = '1.0' +# The full version, including alpha/beta/rc tags. +release = '1.0.0' + +# The language for content autogenerated by Sphinx. Refer to documentation +# for a list of supported languages. +# +# This is also used if you do content translation via gettext catalogs. +# Usually you set "language" from the command line for these cases. +language = None + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This patterns also effect to html_static_path and html_extra_path +exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] + +# The name of the Pygments (syntax highlighting) style to use. +pygments_style = 'sphinx' + +# If true, `todo` and `todoList` produce output, else they produce nothing. +todo_include_todos = False + + +# -- Options for HTML output ---------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +# html_theme = 'alabaster' +html_theme = 'sphinx_rtd_theme' + +# Theme options are theme-specific and customize the look and feel of a theme +# further. For a list of options available for each theme, see the +# documentation. +# +# html_theme_options = {} + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = ['_static'] +html_context = { + 'css_files': [ + '_static/theme_overrides.css', # override wide tables in RTD theme + ], +} + +# Custom sidebar templates, must be a dictionary that maps document names +# to template names. +# +# This is required for the alabaster theme +# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars +# html_sidebars = { +# '**': [ +# 'relations.html', # needs 'show_related': True theme option to display +# 'searchbox.html', +# ] +# } + + +# -- Options for HTMLHelp output ------------------------------------------ + +# Output file base name for HTML help builder. +htmlhelp_basename = 'MultiScannerdoc' + + +# -- Options for LaTeX output --------------------------------------------- + +latex_elements = { + # The paper size ('letterpaper' or 'a4paper'). + # + # 'papersize': 'letterpaper', + + # The font size ('10pt', '11pt' or '12pt'). + # + # 'pointsize': '10pt', + + # Additional stuff for the LaTeX preamble. + # + # 'preamble': '', + + # Latex figure (float) alignment + # + # 'figure_align': 'htbp', +} + +# Grouping the document tree into LaTeX files. List of tuples +# (source start file, target name, title, +# author, documentclass [howto, manual, or own class]). +latex_documents = [ + (master_doc, 'MultiScanner.tex', 'MultiScanner Documentation', + 'MITRE', 'manual'), +] + + +# -- Options for manual page output --------------------------------------- + +# One entry per manual page. List of tuples +# (source start file, name, description, authors, manual section). +man_pages = [ + (master_doc, 'multiscanner', 'MultiScanner Documentation', + [author], 1) +] + + +# -- Options for Texinfo output ------------------------------------------- + +# Grouping the document tree into Texinfo files. List of tuples +# (source start file, target name, title, author, +# dir menu entry, description, category) +texinfo_documents = [ + (master_doc, 'MultiScanner', 'MultiScanner Documentation', + author, 'MultiScanner', 'One line description of project.', + 'Miscellaneous'), +] diff --git a/docs/custom/analysis-module.rst b/docs/custom/analysis-module.rst new file mode 100644 index 00000000..dd11a8dd --- /dev/null +++ b/docs/custom/analysis-module.rst @@ -0,0 +1,67 @@ +Developing an Analysis Module +============================= + +Modules are intended to be quickly written and incorporated into the MultiScanner framework. A module must be in the modules folder for it to be used on the next run. The configuration file does not need to be manually updated. + +See this :ref:`example`. + +Mandatory Functions +------------------- + +When writing a new module, two mandatory functions must be defined: check() and scan(). Additional functions can be written as required. + +check() +^^^^^^^ + +The check() function tests whether or not the scan function should be run. + +**Inputs:** There are two supported argument sets with this function: ``check()`` and ``check(conf=DEFAULTCONF)``. If a module has a global variable DEFAULTCONF, the second argument set is required. + +**Outputs:** The return value of the check() function is a boolean (True or False). A True return value indicated the scan() function should be run; a False return value indicates the module should no longer be run. + +scan() +^^^^^^ +The scan() function performs the analytic and returns the results. + +**Inputs:** There are two supported argument sets with this function: ``scan(filelist)`` and ``scan(filelist, conf=DEFAULTCONF)``. If a module has a global variable DEFAULTCONF, the second argument set is required. + +**Outputs:** There are two return values of the scan() function: Results and Metadata (i.e., ``return (Results, Metadata)``). + +- **Results** is a list of tuples, the tuple values being the filename and the corresponding scan results (e.g.,``[("file1.exe", "Executable"), ("file2.jpg", "Picture")]``). + +- **Metadata** is a dictionary of metadata information from the module. There are two required pieces of metadata ``Name`` and ``Type``. ``Name`` is the name in the module and will be used in the report. ``Type`` is what type of module it is (e.g., Antivirus, content detonation). This information is used for a grouping feature in the report generation and provides context to a newly written module. Optionally, metadata information can be disabled and not be included in the report by setting ``metadata["Include"] = False``. + +Special Globals +--------------- + +There are two global variables that when present, affect the way the module is called. + +**DEFAULTCONF** - This is a dictionary of configuration settings. When set, the settings will be written to the configuration file, making it user editable. The configuration object will be passed to the module's check and scan function and must be an argument in both functions. + +**REQUIRES** - This is a list of analysis results required by a module. For example, ``REQUIRES = ['MD5']`` will be set to the output from the module MD5.py. An :ref:`example` is provided. + +Module Interface +---------------- + +The module interface is a class that is put into each module as it is run. This allows for several features to be added for interacting with the framework at runtime. It is injected as `multiscanner` in the global namespace. + +Variables +^^^^^^^^^ + +* ``write_dir`` - This is a directory path that the module can write to. This will be unique for each run. +* ``run_count`` - This is an integer that increments for each subscan that is called. It is useful for preventing infinite recursion. + +Functions +^^^^^^^^^ + +* ``apply_async(func, args=(), kwds={}, callback=None)`` - This mirrors multiprocessing.Pool.apply_async and returns a `multiprocessing.pool.AsyncResult `_. The pool is shared by all modules. +* ``scan_file(file_path, from_filename)`` - This will scan a file that was found inside another file. `file_path` should be the extracted file on the filesystem (you can write it in path at `multiscanner.write_dir`). `from_filename` is the file it was extracted from. + +Configuration +------------- + +If a module requires configuration, the DEFAULTCONF global variable must be defined. This variable is passed to both check() and scan(). The configuration will be read from the configuration file if it is present. If the file is not present, it will be written into the configuration file. + +If ``replacement path`` is set in the configuration, the module will receive file names, with the folder path replaced with the variable's value. This is useful for analytics which are run on a remote machine. + +By default, ConfigParser reads everything in as a string, before options are passed to the module and ``ast.literal_eval()`` is run on each option. If a string is not returned when expected, this is why. This does mean that the correct Python type will be returned instead of all strings. diff --git a/docs/custom/analytics.rst b/docs/custom/analytics.rst new file mode 100644 index 00000000..a6bf4def --- /dev/null +++ b/docs/custom/analytics.rst @@ -0,0 +1,14 @@ +Developing an Analytic +====================== + +Enabling analytics and advanced queries is the primary advantage of running several tools against a sample, extracting as much information as possible, and storing the output in a common datastore. For example, the following types of analytics and queries might be of interest: + +- cluster samples +- outlier samples +- samples for deep-dive analysis +- gaps in current toolset +- machine learning analytics on tool outputs + +Analytic development is currently ad hoc. Until interfaces are created to standardize development, the :ref:`analytics` section might prove useful - it contains development details of the **ssdeep** analytic. + +Here's the `ssdeep code `_ to use as a reference for how one might implement an analytic. \ No newline at end of file diff --git a/docs/custom/example.rst b/docs/custom/example.rst new file mode 100644 index 00000000..107a0821 --- /dev/null +++ b/docs/custom/example.rst @@ -0,0 +1,55 @@ +.. _example: + +Example Module +============== + +.. code-block:: python + + from __future__ import (division, absolute_import, with_statement, + print_function, unicode_literals) + + TYPE = "Example" + NAME = "include example" + REQUIRES = ["libmagic", "MD5"] + DEFAULTCONF = { + 'ENABLED': True, + } + + def check(conf=DEFAULTCONF): + # If the config disabled the module don't run + if not conf['ENABLED']: + return False + # If one of the required modules failed, don't run + if None in REQUIRES: + return False + return True + + + def scan(filelist, conf=DEFAULTCONF): + # Define our results array + results = [] + # Pull out the libmagic results and metadata + libmagicresults, libmagicmeta = REQUIRES[0] + + # Pull out the md5 results and metadata + md5results, md5meta = REQUIRES[1] + # Make the md5 results a dictionary + md5dict = dict(md5results) + + # Run through each value in the libmagic results + for filename, libmagicresult in libmagicresults: + if libmagicresult.startswith('PDF document'): + # If the file's md5 is present we will use that in the results + if filename in md5dict: + results.append((filename, md5dict[filename] + " is a pdf")) + # If we don't know the md5 use the filename instead + else: + results.append((filename, "is a pdf")) + + # Create out metadata dictionary + metadata = {} + metadata["Name"] = NAME + metadata["Type"] = TYPE + + # Return our super awesome results + return (results, metadata) diff --git a/docs/custom/index.rst b/docs/custom/index.rst new file mode 100644 index 00000000..a2a60672 --- /dev/null +++ b/docs/custom/index.rst @@ -0,0 +1,9 @@ +Custom Development +================== + +.. toctree:: + + analysis-module + analytics + storage-module + example diff --git a/docs/custom/storage-module.rst b/docs/custom/storage-module.rst new file mode 100644 index 00000000..0517f681 --- /dev/null +++ b/docs/custom/storage-module.rst @@ -0,0 +1,41 @@ +.. _writing-a-storage-module: + +Writing a Storage Module +======================== + +Each storage object is a class which needs to be derived from ``storage.Storage``. You can have more than one storage object per python file. + +Required components +------------------- +You will need to override ``store(self, results)``. ``results`` is a python dictionary that is one of two formats. It is either: + +.. code-block:: json + + { + "Files": { + "file1": {}, + "file2": {} + } + "Metadata": { + "module1": {}, + "module2": {} + } + } + +or + +.. code-block:: json + + { + "file1": {}, + "file2": {} + } + +A storage module should support both, even if the metadata is discarded. + +Optional components +------------------- + +- You can override ``DEFAULTCONF`` in your storage module. This is a dictionary of config options which will appear in the storage config file. +- You can override ``setup(self)``. This should be anything that can be done once to prepare for mutliple calls to ``store``, e.g. opening a network connection or file handle. +- You can override ``teardown(self)``. This will be called when no more ``store`` calls are going to be made. diff --git a/docs/distributed_multiscanner.md b/docs/distributed_multiscanner.md deleted file mode 100644 index 14e2f458..00000000 --- a/docs/distributed_multiscanner.md +++ /dev/null @@ -1,65 +0,0 @@ -# Distributed MultiScanner # -MultiScanner is a file analysis framework that assists the user in evaluating a set of files by automatically running a suite of tools for the user and aggregating the output. Tools can be custom built python scripts, web APIs, software running on another machine, etc. Tools are incorporated by creating modules that run in the MultiScanner framework. - -Modules are designed to be quickly written and easily incorporated into the framework. Currently written and maintained modules are related to malware analytics, but the framework is not limited to that scope. For a list of modules you can look in [modules](../modules), descriptions and config options can be found in [modules.md](modules.md). - -MultiScanner also supports a distributed workflow for sample storage, analysis, and report viewing. This functionality includes a web interface, a REST API, a distributed file system (GlusterFS), distributed report storage / searching (ElasticSearch), and distributed task management (Celery / RabbitMQ). - -## Intended Use case ## -Distributed MultiScanner is intended to solve any combination of these problems / use cases: - -* Malware repository (i.e, long term storage of binaries and metadata) -* Scalable analysis capabilities - * Every component of the Distributed MultiScanner is designed with scale in mind - * Note this does not include the following: - * The scaling of external malware analysis tools such as Cuckoo - * Does not perform auto-scaling (e.g. auto-provisioning of VM’s, etc) - * New nodes must be deployed manually and added to the Ansible playbook to receive the proper configurations -* Enable analytics on malware samples - * Either by interacting with the ElasticSearch backend or plugging into the web / REST UI - * Cyber Threat Intelligence (CTI) integration / storage -* Export CTI - * Intend to output reports in multiple formats: STIX, MAEC, PDF, HTML, and JSON - * Currently support JSON, MAEC 5.0, and HTML - * Enables sharing of malware analysis results -* Support file submission types: - * Currently support all file formats (e.g. PE, PDF, Office, etc…) - * Currently doesn’t support extraction of files from PCAP / memory dumps / other data streams (but this is in the dev plan) -* Intended users: - * Security Operations Centers (SOCs) - * Malware analysis centers - * CTI sharing organizations - -## Architecture ## -This is the current architecture: - -![Distributed MultiScanner Architecture](imgs/distributed_ms_diagram.PNG) - -When a sample is submitted (either via the web UI or the REST API), the sample is saved to the distributed file system (GlusterFS), a task is added to the distributed task queue (Celery), and an entry is added to the task management database (PostgreSQL). The worker nodes (Celery clients) all have the GlusterFS mounted, which gives them access to the samples for scanning. In our setup, we colocate the worker nodes with the GlusterFS nodes in order to reduce the network load of workers pulling samples from GlusterFS. When a new task is added to the Celery task queue, one of the worker nodes will pull the task and retrieve the corresponding sample from the GlusterFS via its SHA256 value. The worker node then performs the scanning work. Modules can be enabled / disabled via a configuration file. This configuration file is distributed to the workers by Ansible at setup time (details on this process later). When the worker finishes its scans, it will generate a JSON blob and index it into ElasticSearch for permanent storage. It will then update the task management database with a status of "Complete". The user will then be able view the report via the web interface or retrieve the raw JSON. - -## Setup ## -Currently, we deploy this system with Ansible. More information about that process can be found [here](https://github.com/mitre/multiscanner-ansible). We are also currently working to support deploying the distributed architecture via Docker. If you wish to get an idea of how the system works without having to go through the full process of setting up the distributed architecture, look into our docker containers for a standalone [system](docker_standalone.md). Obviously, the standalone system will be far less scalable / robust / feature-rich. However, it will stand up the web UI, the REST API, and an ElasticSearch node for you to see how the system works. The standalone container is intended as an introduction to the system and its capabilities, but not designed for use in production. - -## Architecture Details ## -What follows is a brief discussion of the tools and design choices we made in the creation of this system. - -### Web Frontend ### -The web application runs on [Flask](http://flask.pocoo.org/), uses [Bootstrap](https://getbootstrap.com/) and [jQuery](https://jquery.com/), and served via Apache. It is essentially an aesthetic wrapper around the REST API; all data and services provided are also available by querying the REST API. - -### REST API ### -The REST API is also powered by Flask and served via Apache. It has an underlying PostgreSQL database in order to facilitate task tracking. Additionally, it acts as a gateway to the backend ElasticSearch document store. Searches entered into the web UI will be routed through the REST API and passed to the ElasticSearch cluster. This abstracts the complexity of querying ElasticSearch and gives the user a simple web interface to work with. - -### Task Queue ### -We use Celery as our distributed task queue. - -### Task Tracking ### -PostgreSQL is our task management database. It is here that we keep track of scan times, samples, and the status of tasks (pending, complete, failed). - -### Distributed File System ### -GlusterFS is our distributed file system. Each component that needs access to the raw samples mounts the share via FUSE. We selected GlusterFS because it is much more performant in our use case of storing a large number of small samples than a technology like HDFS would be. - -### Worker Nodes ### -The worker nodes are simply Celery clients running the MultiScanner Python application. Addtionally, we implemented some batching within Celery to improve the performance of our worker nodes (which operate better at scale). Worker nodes will wait until there are 100 samples in its queue or 60 seconds have passed (whichever happens first) before kicking off its scan. These figures are configurable.OB - -### Report Storage ### -We use ElasticSearch to store the results of our file scans. This is where the true power of this system comes in. ElasticSearch allows for performant, full text searching across all our reports and modules. This allows fast access to interesting details from your malware analysis tools, pivoting between samples, and powerful analytics on report output. diff --git a/docs/docker_standalone.md b/docs/docker_standalone.md deleted file mode 100644 index bd118416..00000000 --- a/docs/docker_standalone.md +++ /dev/null @@ -1,21 +0,0 @@ -# Standalone Docker Container Notes # -In order to introduce new users to the power of the MultiScanner framework, web UI, and REST API, we have built a standalone docker application that is simple to run in new environments. Simply clone the top level directory and run: -``` -$ docker-compose up -``` -This will build the 3 necessary containers (one for the web application, one for the REST API, and one for the ElasticSearch backend). - -Running this command will generate a lot of output and take some time. The system is not ready until you see the following output in your terminal: -``` -api_1 | * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit) -``` - -**_Note 1_: We are assuming that you are already running latest version of docker and have the latest version of docker-compose installed on your machine. Guides on how to do that are here: https://docs.docker.com/engine/installation/ and here: https://docs.docker.com/compose/install/** - -**_Note 2_: Since this docker container runs two web applications and an ElasticSearch node, there is a fairly high requirement for RAM / computing power. We'd recommend running this on a machine with at least 4GB of RAM.** - -**_Note 3_: THIS CONTAINER IS NOT DESIGNED FOR PRODUCTION USE. This is simply a primer for using MultiScanner's web interface. Users should not run this in production or at scale. The MultiScanner framework is highly scalable and distributed, but that requires a full install. Currently, we support installing the distributed system via ansible. More information about that process can be found here: https://github.com/mitre/multiscanner-ansible** - -**_Note 4_: This container will only be reachable / functioning on localhost.** - -**_Note 5_: Additionally, if you are installing this system behind a proxy, you must edit the docker-compose.yml file in four places. First, uncomment [lines 18-20](<../docker-compose.yml#L18>) and [lines 35-37](<../docker-compose.yml#L35>). Next, uncomment [lines 25-28](../docker-compose.yml#L25>) and set the correct proxy variables there. Finally, do the same thing in [lines 42-45](../docker-compose.yml#L42>). The docker-compose.yml file has comments to make clear where to make these changes.** diff --git a/docs/elasticsearch.md b/docs/elasticsearch.md deleted file mode 100644 index 8b10bed7..00000000 --- a/docs/elasticsearch.md +++ /dev/null @@ -1,16 +0,0 @@ -# ElasticSearch Usage Notes # -Starting with ElasticSearch 2.X, field names may no longer contain '.' (dot) characters. Thus, the `elasticsearch_storage` module adds a pipeline called 'dedot' with a processor to replace dots in field names with underscores. - -## Setup ## -Add the following to your elasticsearch.yml config for the dedot processor to work: - -``` -script.painless.regex.enabled: true -``` - -If planning to use the Multiscanner web UI, also add the following: - -``` -http.cors.enabled: true -http.cors.allow-origin: "" -``` diff --git a/docs/examples/include_module.py b/docs/examples/include_module.py deleted file mode 100644 index 40b3d63f..00000000 --- a/docs/examples/include_module.py +++ /dev/null @@ -1,49 +0,0 @@ -from __future__ import (division, absolute_import, with_statement, - print_function, unicode_literals) - -TYPE = "Example" -NAME = "include example" -REQUIRES = ["libmagic", "MD5"] -DEFAULTCONF = { - 'ENABLED': True, -} - - -def check(conf=DEFAULTCONF): - # If the config disabled the module don't run - if not conf['ENABLED']: - return False - # If one of the required modules failed, don't run - if None in REQUIRES: - return False - return True - - -def scan(filelist, conf=DEFAULTCONF): - # Define our results array - results = [] - # Pull out the libmagic results and metadata - libmagicresults, libmagicmeta = REQUIRES[0] - - # Pull out the md5 results and metadata - md5results, md5meta = REQUIRES[1] - # Make the md5 results a dictionary - md5dict = dict(md5results) - - # Run through each value in the libmagic results - for filename, libmagicresult in libmagicresults: - if libmagicresult.startswith('PDF document'): - # If the file's md5 is present we will use that in the results - if filename in md5dict: - results.append((filename, md5dict[filename] + " is a pdf")) - # If we don't know the md5 use the filename instead - else: - results.append((filename, "is a pdf")) - - # Create out metadata dictionary - metadata = {} - metadata["Name"] = NAME - metadata["Type"] = TYPE - - # Return our super awesome results - return (results, metadata) diff --git a/docs/imgs/Selection_001.png b/docs/imgs/Selection_001.png deleted file mode 100644 index 6d46951f..00000000 Binary files a/docs/imgs/Selection_001.png and /dev/null differ diff --git a/docs/imgs/Selection_002.png b/docs/imgs/Selection_002.png deleted file mode 100644 index a725e029..00000000 Binary files a/docs/imgs/Selection_002.png and /dev/null differ diff --git a/docs/imgs/Selection_003.png b/docs/imgs/Selection_003.png deleted file mode 100644 index 9609fcc4..00000000 Binary files a/docs/imgs/Selection_003.png and /dev/null differ diff --git a/docs/imgs/Selection_004.png b/docs/imgs/Selection_004.png deleted file mode 100644 index 03e3984d..00000000 Binary files a/docs/imgs/Selection_004.png and /dev/null differ diff --git a/docs/imgs/Selection_005.png b/docs/imgs/Selection_005.png deleted file mode 100644 index a72df539..00000000 Binary files a/docs/imgs/Selection_005.png and /dev/null differ diff --git a/docs/imgs/Selection_006.png b/docs/imgs/Selection_006.png deleted file mode 100644 index 6c210073..00000000 Binary files a/docs/imgs/Selection_006.png and /dev/null differ diff --git a/docs/imgs/Selection_007.png b/docs/imgs/Selection_007.png deleted file mode 100644 index aaa346da..00000000 Binary files a/docs/imgs/Selection_007.png and /dev/null differ diff --git a/docs/imgs/Selection_008.png b/docs/imgs/Selection_008.png deleted file mode 100644 index 5bfc60eb..00000000 Binary files a/docs/imgs/Selection_008.png and /dev/null differ diff --git a/docs/imgs/Selection_009.png b/docs/imgs/Selection_009.png deleted file mode 100644 index 5bb6ffc1..00000000 Binary files a/docs/imgs/Selection_009.png and /dev/null differ diff --git a/docs/imgs/Selection_010.png b/docs/imgs/Selection_010.png deleted file mode 100644 index 05406dbd..00000000 Binary files a/docs/imgs/Selection_010.png and /dev/null differ diff --git a/docs/imgs/Selection_011.png b/docs/imgs/Selection_011.png deleted file mode 100644 index 73241164..00000000 Binary files a/docs/imgs/Selection_011.png and /dev/null differ diff --git a/docs/imgs/Selection_012.png b/docs/imgs/Selection_012.png deleted file mode 100644 index d5c3ea74..00000000 Binary files a/docs/imgs/Selection_012.png and /dev/null differ diff --git a/docs/imgs/Selection_013.png b/docs/imgs/Selection_013.png deleted file mode 100644 index 3f0134dd..00000000 Binary files a/docs/imgs/Selection_013.png and /dev/null differ diff --git a/docs/imgs/Selection_014.png b/docs/imgs/Selection_014.png deleted file mode 100644 index fea2e6a2..00000000 Binary files a/docs/imgs/Selection_014.png and /dev/null differ diff --git a/docs/imgs/Selection_015.png b/docs/imgs/Selection_015.png deleted file mode 100644 index 03e0276e..00000000 Binary files a/docs/imgs/Selection_015.png and /dev/null differ diff --git a/docs/imgs/Selection_016.png b/docs/imgs/Selection_016.png deleted file mode 100644 index 7aaff1dc..00000000 Binary files a/docs/imgs/Selection_016.png and /dev/null differ diff --git a/docs/imgs/Selection_017.png b/docs/imgs/Selection_017.png deleted file mode 100644 index 62f0a8a0..00000000 Binary files a/docs/imgs/Selection_017.png and /dev/null differ diff --git a/docs/imgs/Selection_018.png b/docs/imgs/Selection_018.png deleted file mode 100644 index 0596e71e..00000000 Binary files a/docs/imgs/Selection_018.png and /dev/null differ diff --git a/docs/imgs/Selection_019.png b/docs/imgs/Selection_019.png deleted file mode 100644 index 98320fd6..00000000 Binary files a/docs/imgs/Selection_019.png and /dev/null differ diff --git a/docs/imgs/Selection_020.png b/docs/imgs/Selection_020.png deleted file mode 100644 index 0de26558..00000000 Binary files a/docs/imgs/Selection_020.png and /dev/null differ diff --git a/docs/imgs/Selection_021.png b/docs/imgs/Selection_021.png deleted file mode 100644 index 1380d0ae..00000000 Binary files a/docs/imgs/Selection_021.png and /dev/null differ diff --git a/docs/imgs/Selection_022.png b/docs/imgs/Selection_022.png deleted file mode 100644 index a281a02e..00000000 Binary files a/docs/imgs/Selection_022.png and /dev/null differ diff --git a/docs/imgs/Selection_023.png b/docs/imgs/Selection_023.png deleted file mode 100644 index 5c728fbe..00000000 Binary files a/docs/imgs/Selection_023.png and /dev/null differ diff --git a/docs/imgs/Selection_024.png b/docs/imgs/Selection_024.png deleted file mode 100644 index 31d5ad83..00000000 Binary files a/docs/imgs/Selection_024.png and /dev/null differ diff --git a/docs/imgs/distributed_ms_diagram.PNG b/docs/imgs/distributed_ms_diagram.PNG deleted file mode 100755 index 9a821765..00000000 Binary files a/docs/imgs/distributed_ms_diagram.PNG and /dev/null differ diff --git a/docs/index.rst b/docs/index.rst new file mode 100644 index 00000000..fd94e20e --- /dev/null +++ b/docs/index.rst @@ -0,0 +1,18 @@ +.. MultiScanner documentation master file, created by + sphinx-quickstart on Fri Dec 22 13:35:06 2017. + You can adapt this file completely to your liking, but it should at least + contain the root `toctree` directive. + +MultiScanner +============ + +.. toctree:: + :maxdepth: 2 + + overview + arch + use-cases + install + use/index + custom/index + testing diff --git a/docs/install.rst b/docs/install.rst new file mode 100644 index 00000000..a7ebedb6 --- /dev/null +++ b/docs/install.rst @@ -0,0 +1,122 @@ +Installation +============ + +Installation information for the different components of MultiScanner is provided below. To get an idea of how the system works without going through the full process of setting up the distributed architecture, refer to the section on :ref:`standalone-docker-installation`. + +The Docker standalone system is less scalable, robust, and feature-rich, but it enables easy stand up the web UI, the REST API, and an Elasticsearch node, allowing users to quickly see how the system works. The standalone container is intended as an introduction to the system and its capabilities, but is not designed for operational use. + +System Requirements +------------------- + +Python 3.6 is recommended. Compatibility with Python 2.7+ and 3.4+ is supported but not thoroughly maintained and tested. Please submit an issue or a pull request fixing any issues found with other versions of Python. + +An installer script is included in the project (`install.sh `_), which installs the prerequisites on most systems. + +Currently, MultiScanner is deployed with Ansible, and we're working to support distributed architecture deployment via Docker. + +Installing Ansible +------------------ + +The `installer script `_ should install the required Python packages for users of RedHat- or Debian-based Linux distributions. Users of other distributions should refer to `requirements.txt `_. + +MultiScanner requires a configuration file to run. After cloning the repository, generate the MultiScanner default +configuration by running ``python multiscanner.py init``. The command can be used to rewrite the configuration file to its default state or, if new modules have been written, to add their configuration details to the configuration +file. + +.. _installing-analytic-machines: + +Installing Analytic Machines +---------------------------- + +Default modules have the option of being run locally or via SSH. The development team +runs MultiScanner on a Linux host and hosts the majority of analytical tools on +a separate Windows machine. The SSH server used in this environment is `freeSSHd `_. + +A network share accessible to both the MultiScanner and the analytic machines is +required for the multi-machine setup. Once configured, the network share path must +be identified in the configuration file, config.ini (an example can be found +`here `_). +To do this, set the ``copyfilesto`` option under ``[main]`` to be the mount point on the system running MultiScanner. +Modules can have a ``replacement path`` option, which is the network share mount point +on the analytic machine. + +Installing Elasticsearch +------------------------ + +Starting with Elasticsearch 2.x, field names can no longer contain '.' (dot) characters. Thus, the MultiScanner elasticsearch_storage module adds a pipeline called "dedot" with a processor to replace dots in field names with underscores. + +Add the following to the elasticsearch.yml configuration file for the dedot processor to work:: + + script.painless.regex.enabled: true + + +To use the Multiscanner web UI, also add the following:: + + http.cors.enabled: true + http.cors.allow-origin: "" + +Configuration +------------- + +MultiScanner and its modules are configured within the configuration file, config.ini. An example can be found +`here `_. + +The following parameters configure MultiScanner itself, and go in the ``[main]`` +section of the config file. + +==================== ============================= +Parameter Description +==================== ============================= +**copyfilesto** This is where the script will copy each file that is to be scanned. This can be removed or set to False to disable this feature. +**group-types** This is the type of analytics to group into sections for the report. This can be removed or set to False to disable this feature. +**storage-config** Path to the storage config file. +**api-config** Path to the API config file. +**web-config** Path to the Web UI config file. +==================== ============================= + +Modules are intended to be quickly written and incorporated into the framework. Note that: + +* A finished module must be placed in the modules folder before it can be used. + +* The configuration file does not need to be manually updated. + +* Modules are configured within the same configuration file, config.ini. + +See :ref:`analysis-modules` for information about all current modules and their configuration parameters. + +.. _standalone-docker-installation: + +Standalone Docker Installation +------------------------------ + +To introduce new users to the power of the MultiScanner framework, web UI, and REST API, we have built a standalone docker application that is simple to run in new environments. Simply clone the top level directory and run:: + + $ docker-compose up + +This will build the three necessary containers (one for the web application, one for the REST API, and one for the Elasticsearch backend). + +Running this command will generate a lot of output and take some time. The system is not ready until you see the following output in your terminal:: + + api_1 | * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit) + +Now you can go to the web interface at ``http://localhost:8000``. + +.. note:: + + We are assuming that you are already running latest version of docker and have the latest version of docker-compose installed on your machine. Guides on how to do that are `here `__. and `here `__. + +.. note:: + + Since this docker container runs two web applications and an ElasticSearch node, there is a fairly high requirement for RAM / computing power. We'd recommend running this on a machine with at least 4GB of RAM. + +.. warning:: + + THIS CONTAINER IS NOT DESIGNED FOR PRODUCTION USE. This is simply a primer for using MultiScanner's web interface. Users should not run this in production or at scale. The MultiScanner framework is highly scalable and distributed, but that requires a full install. Currently, we support installing the distributed system via ansible. More information about that process can be found in `this repo `_. + +.. note:: + + This container will only be reachable / functioning on localhost. + +.. note:: + + Additionally, if you are installing this system behind a proxy, you must edit the docker-compose.yml file in four places. First, uncomment `lines 18-20 <../docker-compose.yml#L18>`_ and `lines 35-37 <../docker-compose.yml#L35>`_. Next, uncomment `lines 25-28 <../docker-compose.yml#L25>`_ and set the correct proxy variables there. Finally, do the same thing in `lines 42-45 <../docker-compose.yml#L42>`_. The docker-compose.yml file has comments to make clear where to make these changes. diff --git a/docs/make.bat b/docs/make.bat new file mode 100644 index 00000000..c9276a9f --- /dev/null +++ b/docs/make.bat @@ -0,0 +1,36 @@ +@ECHO OFF + +pushd %~dp0 + +REM Command file for Sphinx documentation + +if "%SPHINXBUILD%" == "" ( + set SPHINXBUILD=sphinx-build +) +set SOURCEDIR=. +set BUILDDIR=_build +set SPHINXPROJ=MultiScanner + +if "%1" == "" goto help + +%SPHINXBUILD% >NUL 2>NUL +if errorlevel 9009 ( + echo. + echo.The 'sphinx-build' command was not found. Make sure you have Sphinx + echo.installed, then set the SPHINXBUILD environment variable to point + echo.to the full path of the 'sphinx-build' executable. Alternatively you + echo.may add the Sphinx directory to PATH. + echo. + echo.If you don't have Sphinx installed, grab it from + echo.http://sphinx-doc.org/ + exit /b 1 +) + +%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% +goto end + +:help +%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% + +:end +popd diff --git a/docs/module_writing.md b/docs/module_writing.md deleted file mode 100644 index 144290ac..00000000 --- a/docs/module_writing.md +++ /dev/null @@ -1,49 +0,0 @@ -Module Writing --------------- -Modules are intended to be easily written and incorporated into the MultiScanner framework. A finished module must be in the modules folder for it to be used on the next run. - -## Functions ## -When writing a new module, there are two mandatory functions that must be defined: check() and scan(). Additional functions can be written if required. - -### check() ### -The check() function tests whether or not the scan function should be run. - -**Inputs:** There are two supported argument sets with this function: `check()` and `check(conf=DEFAULTCONF)`. If a module has a global variable DEFAULTCONF, the second argument set is required. - -**Outputs:** The return value of the check() function is a boolean (True or False). A True return value indicated the scan() function should be run; a False return value indicates the module should no longer be run. - -### scan() ### -The scan() function performs the analytic and returns the results. - -**Inputs:** There are two supported argument sets with this function: `scan(filelist)` and `scan(filelist, conf=DEFAULTCONF)`. If a module has a global variable DEFAULTCONF, the second argument set is required. - -**Outputs:** There are two return values of the scan() function: Results and Metadata (i.e., `return (Results, Metadata)`). - -- **Results** is a list of tuples, the tuples values being the filename and the corresponding scan results (i.e.,`[("file1.exe", "Executable"), ("file2.jpg", "Picture")]`) - -- **Metadata** is a dictionary of metadata information from the module. There are two required pieces of metadata `Name` and `Type`. Name is the name in the module and will be used in the report. Type is what type of module it is (e.g., Antivirus, content detonation). This information is used for a grouping feature in the report generation and is helpful to provide context to a newly written module. Optionally, metadata information can be disabled and not be included in the report by setting `metadata["Include"] = False`. - -## Special Globals ## -There are two global variables that when present, affect the way the module is called. - -**DEFAULTCONF** - This is a dictionary of configuration settings. When set, the settings will be written to the configuration file, making it user editable. The configuration object will be passed to the module's check and scan function and must be an argument in both functions. - -**REQUIRES** - This is a list of the module results needed for a module. For example, `REQUIRES = ['MD5']` will be set to the output from the module MD5.py. A code sample is provided in [examples/include_module.py](examples/include_module.py) - -## Module Interface ## -The module interface is a class that is put into each module as it is run. This allows for several features to be added for interacting with the framework at runtime. It is injected as `multiscanner` in the global namespace. - -### Variables ### -* `write_dir` - This is a directory path that your module can write to. This will be unique for each run. -* `run_count` - This is an integer that increments for each subscan that is called. It is useful for preventing infinite recurring - -### Functions ### -* `apply_async(func, args=(), kwds={}, callback=None)` - This mirrors multiprocessing.Pool.apply_async and returns a [multiprocessing.pool.AsyncResult](https://docs.python.org/2/library/multiprocessing.html#multiprocessing.pool.AsyncResult). The pool is shared by all modules. -* `scan_file(file_path, from_filename)` - This will scan a file that was found inside another file. `file_path` should be the extracted file on the filesystem (you can write it in path at `multiscanner.write_dir`). `from_filename` is the file it was extracted from. - -## Config ## -If a module requires configuration, the DEFAULTCONF global variable must be defined. This variable is passed to both check() and scan(). The configuration will be read from the configuration file if it is present. If the file is not present, it will be written into the configuration file. - -If `replacement path` is set in the configuration, the module will receive file names, with the folder path replaced with the variable's value. This is useful for analytics which are run on a remote machine. - -By default, ConfigParser reads everything in as a string, before options are passed to the module `ast.literal_eval()` is ran on each option. If a string is not returned when expected, this is why. This does mean that the correct python type will be returned instead of all strings. diff --git a/docs/modules.md b/docs/modules.md deleted file mode 100644 index f29f9fed..00000000 --- a/docs/modules.md +++ /dev/null @@ -1,107 +0,0 @@ -### General ### -- **path** - This is where the executable is located -- **cmdline** - This is an array of command line options be to passed to the executable -- **host** - This is the hostname, port, and username of the machine that will be SSHed into to run the analytic if the executable is not present on the local machine. -- **key** - This is the SSH key to be used to SSH into the host. -- **replacement path** - If the main config is set to copy the scanned files this will be what it replaces the path with. It should be where the network share is mounted -- **ENABLED** - When set to false the module will not run - -### [main] ### -This is the configuration for the main script - -- **copyfilesto** - This is where the script will copy each file that is to be scanned. This can be removed or set to False to disable this feature -- **group-types** - This is the type of analytics to group into sections for the report. This can be removed or set to False to disable this feature - -### [AVGScan] ### -This module scans a file with AVG 2014 anti-virus. - -### [ClamAVScan] ### -This module scans a file with ClamAV. - -### [Cuckoo] ### -This module submits a file to a Cuckoo Sandbox cluster for analysis - -- **API URL** - This is the URL to the API server -- **timeout** - This is max time a sample with run for -- **running timeout** - This is an additional timeout, if a task is in the running state this many seconds past **timeout** we will consider the task failed. -- **delete tasks** - When set to True, tasks will be deleted from cuckoo after detonation. This is to prevent filling up the Cuckoo machine's disk with reports. -- **maec** - When set to True, [MAEC](https://maecproject.github.io) JSON report is added to Cuckoo JSON report. *NOTE*: Cuckoo needs MAEC reporting enabled to produce results. - -### [ExifToolsScan] ### -This module scans the file with Exif tools and returns the results. - -- **remove-entry** - A python list of ExifTool results that should not be included in the report. File system level attributes are not useful and stripped out - -### [FireeyeScan] ### -This module uses a FireEye AX to scan the files. It uses the Malware Repository feature to automatically scan files. This may not be the best way but it does work. It will copy the files to be scanned to the mounted share folders. -*NOTE*: This module is suuuuuper slow - -- **base path** - The mount point where the fireeye images folders are -- **src folder** - The folder name where input files are put -- **fireeye images** - A python list of the VMs in fireeye. These are used to generate where to copy the files. -- **enabled** - True or False -- **good path** - The folder name where good files are put -- **cheatsheet** - Not implemented yet - -### [flarefloss] ### -This module extracts ASCII, UTF-8, stack and obfuscated strings from executable files. More information about module configuration can be found at the [flare-floss](https://github.com/fireeye/flare-floss/blob/master/doc/usage.md) documentation. - -### [impfuzzy] ### -This module calculates a fuzzy hash using ssdeep where Windows PE imports is the input. This strategy was originally described in a [blog post](http://blog.jpcert.or.jp/2016/05/classifying-mal-a988.html) from JPCERT/CC. - -### [libmagic] ### -This module runs libmagic against the files. - -- **magicfile** - The path to the compiled magic file you wish to use. If None it will use the default one. - -### [MD5] ### -This module generates the MD5 hash of the files. - -### [McAfeeScan] ### -This module scans the files with McAfee AntiVirus Command Line. - -### [officemeta] ### -This module extracts metadata from Microsoft Office documents. - -*Note*: This module does not support [OOXML](https://en.wikipedia.org/wiki/Office_Open_XML) documents (e.g., docx, pptx, xlsx). - -### [pdfinfo] ### -This module extracts out feature information from PDF files. It uses [pdf-parser](http://blog.didierstevens.com/programs/pdf-tools/) - -### [PEFile] ### -This module extracts out feature information from EXE files. It uses [pefile](https://code.google.com/p/pefile/) which is currently not available for python 3. - -### [SHA256] ### -This module generates the SHA256 hash of the files. - -### [ssdeeper] ### -This module generates context triggered piecewise hashes (CTPH) for the files. More information can be found on the [ssdeep website](http://ssdeep.sourceforge.net/). - -### [Tika] ### -This module extracts metadata from the file using [Tika](https://tika.apache.org/). For configuration of the module see the [tika-python](https://github.com/chrismattmann/tika-python/blob/master/README.md) documentation. - -- **remove-entry** - A python list of Tika results that should not be included in the report. - -### [TrID] ### -This module runs [TrID](http://mark0.net/soft-trid-e.html) against the files. The definition file should be in the same folder as the executable - -### [vtsearch] ### -This module searches [virustotal](https://www.virustotal.com/) for the files hash and download the report if available. - -- **apikey** - This is your public/private api key. You can optionally make it a list and the requests will be distributed across them. This is useful when two groups with private api keys want to share the load and reports - -### [VxStream] ### -This module submits a file to a VxStream Sandbox cluster for analysis - -- **API URL** - This is the URL to the API server (include the /api/ in this URL) -- **API Key** - This is the user's API key to the API server -- **API Secret** - This is the user's secret to the API server -- **timeout** - This is max time a sample with run for -- **running timeout** - This is an additional timeout, if a task is in the running state this many seconds past **timeout** we will consider the task failed. - -### [YaraScan] ### -This module scans the files with yara and returns the results. You will need yara-python installed for this module. - -- **ruledir** - The directory to look for rule files in -- **fileextensions** - A python array of all valid rule file extensions. Files not ending in one of these will be ignored. -- **ignore-tags** - A python array of yara rule tags that will not be included in the report. diff --git a/docs/overview.rst b/docs/overview.rst new file mode 100644 index 00000000..addd4456 --- /dev/null +++ b/docs/overview.rst @@ -0,0 +1,25 @@ +Overview +======== +MultiScanner is a distributed file analysis framework that assists the user in evaluating a set +of files by automatically running a suite of tools and aggregating the output. +Tools can be custom Python scripts, web APIs, software running on another machine, etc. +Tools are incorporated by creating modules that run in the MultiScanner framework. + +By design, modules can be quickly written and easily incorporated into the framework. +While current modules are related to malware analysis, the MultiScanner framework is not limited in +scope. For descriptions of current modules, see :ref:`analysis-modules`. + +MultiScanner supports a distributed workflow for sample storage, analysis, and report viewing. This functionality includes a web interface, a REST API, a distributed file system (GlusterFS), distributed report storage / searching (Elasticsearch), and distributed task management (Celery / RabbitMQ). See the :ref:`complete-workflow` section for details. + +MultiScanner is available as open source in `GitHub `_. + +Key Capabilities +---------------- +As illustrated in the diagram below, MultiScanner helps the malware analyst, enabling analysis with automated tools and manual tools, providing integration and scaling capabilities, and corrolating analysis results. It allows analysts to associate metadata with samples and also allows integration of data from external sources. MultiScanner is particularly useful because data is linked across tools and samples, allowing pivoting and analytics. + +.. figure:: _static/img/overview.png + :align: center + :scale: 40 % + :alt: Overview + + Key Capabilities diff --git a/docs/requirements.txt b/docs/requirements.txt new file mode 100644 index 00000000..82133027 --- /dev/null +++ b/docs/requirements.txt @@ -0,0 +1,2 @@ +sphinx +sphinx_rtd_theme diff --git a/docs/storage_module.md b/docs/storage_module.md deleted file mode 100644 index bd2f7796..00000000 --- a/docs/storage_module.md +++ /dev/null @@ -1,33 +0,0 @@ -# Writing a storage module # -Each storage object is a class which needs to be derived from `storage.Storage`. You can have more than one storage object -per python file. - -## Required components ## -You will need to override `store(self, results)` results is a python dictionary that is one of two formats. It is either -```json -{ - 'Files': { - 'file1': {}, - 'file2': {} - } - 'Metadata': { - 'module1': {}, - 'module2': {} - } -} -``` -or -```json -{ - 'file1': {}, - 'file2': {} -} -``` -A storage module should support both, even if the metadata is discarded. - -## Optional components ## -* You can override `DEFAULTCONF` in your storage module which will appear in the storage config file. This is a dictionary -of config options. -* You can override `setup(self)`. This should be anything that can be done once to prepare for mutliple calls to `store` -IE opening a network connection or file handle. -* You can override `teardown(self)`. This will be called when no more `store` calls are going to be made. diff --git a/docs/testing.md b/docs/testing.md deleted file mode 100644 index 423c38c5..00000000 --- a/docs/testing.md +++ /dev/null @@ -1,55 +0,0 @@ -# Testing # -Running the MultiScanner test suite is fairly straight forward. We use the [pytest framework](https://docs.pytest.org/en/latest/), which you can install by running: -``` -$ pip install pytest -``` - -After that, simply cd into the top level multiscanner directory and run the command: -``` -$ pytest -``` - -This will automatically find all the tests in the tests/ directory and run them. We encourage developers of new modules and users to contribute to our testing suite! - -## Front-end Tests with Selenium ## -Running front-end tests with Selenium requires installation and configuration outside of the Python environment, namely -the installation of Firefox and geckodriver. - -1. Install Firefox. -1. Download latest geckodriver release from [GitHub](https://github.com/mozilla/geckodriver/releases). -1. Add geckodriver to system path. - -Additional information about geckodriver setup can be found -[here](https://developer.mozilla.org/en-US/docs/Mozilla/QA/Marionette/WebDriver#Setting_up_the_geckodriver_executable). - -If pytest is unable to find Firefox or geckodriver, the front-end tests will be skipped. This is indicated by a -'s' in the pytest output. - -Tests have been run successfullly with Firefox 58 and geckodriver 0.19.1 on macOS and Ubuntu 14.04, 16.04. - -### CentOS ### -The Firefox version available in the base repo is too far out-of-date to be compatible with the tests. Manually update -Firefox to the latest version. - -1. Remove old version of Firefox. - - ``` - # yum remove firefox - ``` - -2. You may need to install these dependencies for Firefox: - ``` - # yum install -y gtk3 glib-devel glib pango pango-devel - ``` -3. Download latest version of Firefox. - - ``` - # cd /usr/local - # curl -L http://ftp.mozilla.org/pub/firefox/releases/58.0/linux-x86_64/en-US/firefox-58.0.tar.bz2 | tar -xjf - ``` - -4. Add symlink to bin dir. - - ``` - # ln -s /usr/local/firefox/firefox /usr/bin/firefox - ``` diff --git a/docs/testing.rst b/docs/testing.rst new file mode 100644 index 00000000..b40a26b3 --- /dev/null +++ b/docs/testing.rst @@ -0,0 +1,52 @@ +Testing +======= + +Running the MultiScanner test suite is fairly straight forward. We use the `pytest framework `_, which you can install by running:: + + $ pip install pytest + +After that, simply cd into the top level multiscanner directory and run the command:: + + $ pytest + +This will automatically find all the tests in the tests/ directory and run them. We encourage developers of new modules and users to contribute to our testing suite! + +Front-end Tests with Selenium +----------------------------- + +Running front-end tests with Selenium requires installation and configuration outside of the Python environment, namely +the installation of Firefox and geckodriver. + +1. Install Firefox. +2. Download latest geckodriver release from `GitHub `_. +3. Add geckodriver to system path. + +Additional information about geckodriver setup can be found +`here `_. + +If pytest is unable to find Firefox or geckodriver, the front-end tests will be skipped. This is indicated by a +'s' in the pytest output. + +Tests have been run successfullly with Firefox 58 and geckodriver 0.19.1 on macOS and Ubuntu 14.04, 16.04. + +CentOS +^^^^^^ +The Firefox version available in the base repo is too far out-of-date to be compatible with the tests. Manually update +Firefox to the latest version. + +1. Remove old version of Firefox:: + + $ yum remove firefox + +2. You may need to install these dependencies for Firefox:: + + $ yum install -y gtk3 glib-devel glib pango pango-devel + +3. Download latest version of Firefox:: + + $ cd /usr/local + $ curl -L http://ftp.mozilla.org/pub/firefox/releases/58.0/linux-x86_64/en-US/firefox-58.0.tar.bz2 | tar -xjf + +4. Add symlink to bin dir:: + + $ ln -s /usr/local/firefox/firefox /usr/bin/firefox diff --git a/docs/use-cases.rst b/docs/use-cases.rst new file mode 100644 index 00000000..c105ff39 --- /dev/null +++ b/docs/use-cases.rst @@ -0,0 +1,26 @@ +Use Cases +========= + +MultiScanner is intended to be used by security operations centers, malware analysis centers, and other organizations involved with cyber threat intelligence (CTI) sharing. This section outlines associated use cases. + +Scalable Malware Analysis +------------------------- +Every component of MultiScanner is designed with scaling in mind, enabling analysis of large malware data sets. + +Note that scaling required for external analysis tools such as Cuckoo Sandbox is beyond the scope of MultiScanner code, as is auto-scaling (e.g., scaling required to auto-provision virtual machines). New worker nodes must be deployed manually and added to the Ansible playbook for proper configuration (see :ref:`installing-analytic-machines`). + +Manual Malware Analysis +------------------------- +MultiScanner can support manual malware analysis via modules that enable analyst interaction. For example, a module could be developed to allow an analyst to interact with IDA Pro to disassemble and analyze a binary file. + +Analysis-Oriented Malware Repository +------------------------------------ +MultiScanner enables long term storage of binaries and metadata associated with malware analysis. + +Data Enrichment +--------------- +Malware analysis results can be enriched in support of CTI sharing objectives. In addition to data derived from analysis of submitted samples, other CTI sources can be integrated with MultiScanner, such as TAXII feeds, commercial CTI providers (FireEye, Proofpoint, CrowdStrike, etc.), and closed-source CTI providers. + +Data Analytics +-------------- +Data analytics can be performed on malware samples either by interacting with the Elasticsearch datastore or via the Web/REST UI. diff --git a/docs/use/index.rst b/docs/use/index.rst new file mode 100644 index 00000000..fabc8511 --- /dev/null +++ b/docs/use/index.rst @@ -0,0 +1,11 @@ +Using MultiScanner +================== + +.. toctree:: + :maxdepth: 1 + + web-ui + python-api + rest-api + use-analysis-mods + use-analytics diff --git a/docs/use/python-api.rst b/docs/use/python-api.rst new file mode 100644 index 00000000..2286fe3f --- /dev/null +++ b/docs/use/python-api.rst @@ -0,0 +1,16 @@ +.. _python-api: + +Python API +========== + +MultiScanner can be incorporated as a module in another project. Below is a simple example of how to import MultiScanner into a Python script. + +.. code-block:: python + + import multiscanner + output = multiscanner.multiscan(file_list) + results = multiscanner.parse_reports(output, python=True) + +``results`` is a dictionary object where each key is a filename of a scanned file. + +``multiscanner.config_init(filepath)`` will create a default configuration file at the location defined by ``filepath``. diff --git a/docs/use/rest-api.rst b/docs/use/rest-api.rst new file mode 100644 index 00000000..60c71b50 --- /dev/null +++ b/docs/use/rest-api.rst @@ -0,0 +1,35 @@ +RESTful API +=========== + +The RESTful API is provided by a Flask app that supports the following operations: + +====== ======================================= ======================================= +Method URI Description +====== ======================================= ======================================= +GET / Test functionality. Should produce: ``{'Message': 'True'}`` +GET /api/v1/files/?raw={t|f} Download sample, defaults to passwd protected zip +GET /api/v1/modules Receive list of modules available +GET /api/v1/tags Receive list of all tags in use +GET /api/v1/tasks Receive list of tasks in MultiScanner +POST /api/v1/tasks POST file and receive report id. + Sample POST usage: + ``curl -i -X POST http://localhost:8080/api/v1/tasks -F file=@/bin/ls`` +GET /api/v1/tasks/ Receive task in JSON format +DELETE /api/v1/tasks/ Delete task_id +GET /api/v1/tasks/search/ Receive list of most recent report for matching samples +GET /api/v1/tasks/search/history Receive list of most all reports for matching samples +GET /api/v1/tasks//file?raw={t|f} Download sample, defaults to passwd protected zip +GET /api/v1/tasks//maec Download the Cuckoo MAEC 5.0 report, if it exists +GET /api/v1/tasks//notes Receive list of this tasks notes +POST /api/v1/tasks//notes Add a note to task +PUT /api/v1/tasks//notes/ Edit a note +DELETE /api/v1/tasks//notes/ Delete a note +GET /api/v1/tasks//report?d={t|f} Receive report in JSON, set d=t to download +GET /api/v1/tasks//pdf Receive PDF report +POST /api/v1/tasks//tags Add tags to task +DELETE /api/v1/tasks//tags Remove tags from task +GET /api/v1/analytics/ssdeep_compare Run ssdeep.compare analytic +GET /api/v1/analytics/ssdeep_group Receive list of sample hashes grouped by ssdeep hash +====== ======================================= ======================================= + +The API endpoints all have Cross Origin Resource Sharing (CORS) enabled. By default it will allow requests from any port on localhost. Change this setting by modifying the ``cors`` setting in the ``api`` section of the api config file. diff --git a/docs/use/use-analysis-mods.rst b/docs/use/use-analysis-mods.rst new file mode 100644 index 00000000..bb282b27 --- /dev/null +++ b/docs/use/use-analysis-mods.rst @@ -0,0 +1,252 @@ +.. _analysis-modules: + +Analysis Modules +================ + +The analysis modules currently available in MultiScanner are listed by catagory below. + + +================================= ======================================== +AV Scans +================================= ======================================== +AVG 2014 Scans sample with AVG 2014 anti-virus +ClamAVScan Scans sample with ClamAV +McAfeeScan Scans sample with McAfee AntiVirus Command Line +Microsoft Security Essentials Scans sample with Microsoft Security Essentials +`Metadefender <#metadefender>`__ Interacts with OPSWAT Metadefender Core 4 Version 3.x, polling Metadefender for scan results. +`vtsearch <#vtsearch>`__ Searches VirusTotal for sample’s hash and downloads the report if available +VFind Runs the CyberSoft VFind anti-malware scanner, part of the `VFind Security Toolkit `_. +================================= ======================================== + + +============================= ======================================== +Database +============================= ======================================== +`NSRL <#nsrl>`__ Looks up a hash in the `National Software Reference Library `_. +============================= ======================================== + + +=================================== ======================================== +Sandbox Detonation +=================================== ======================================== +`Cuckoo Sandbox <#cuckoo>`__ Submits a sample to Cuckoo Sandbox cluster for analysis. +`FireEye API <#fireeyeapi>`__ Detonates the sample in FireEye AX via FireEye's API. +`VxStream <#vxstream>`__ Submits a file to a VxStream Sandbox cluster for analysis. +=================================== ======================================== + + +============================= ======================================== +Machine Learning +============================= ======================================== +MaliciousMacroBot Triage office files with `MaliciousMacroBot `_. +============================= ======================================== + + +==================================== ======================================== +Metadata +==================================== ======================================== +entropy Calculates the Shannon entropy of a file. +`ExifToolsScan <#exiftoolsscan>`__ Scans sample with Exif tools and returns the results. +fileextensions Determines possible file extensions for a file. +`floss <#floss>`__ FireEye Labs Obfuscated String Solver uses static analysis techniques to deobfuscate strings from malware binaries. +`impfuzzy <#impfuzzy>`__ Calculates a fuzzy hash using ssdeep on Windows PE imports. +`libmagic <#libmagic>`__ Runs libmagic against the files to identify filetype. +MD5 Generates the MD5 hash of the sample. +`officemeta <#officemeta>`__ Extracts metadata from Microsoft Office documents. +`pdfinfo <#pdfinfo>`__ Extracts feature information from PDF files using `pdf-parser `_. +`PEFile <#pefile>`__ Extracts features from EXE files. +pehasher Computes pehash values using a variety of algorithms: totalhase, anymaster, anymaster_v1_0_1, endgame, crits, and pehashng. +SHA1 Generates the SHA1 hash of the sample. +SHA256 Generates the SHA256 hash of the sample. +`ssdeep <#ssdeeper>`__ Generates context triggered piecewise hashes (CTPH) for files. More information can be found on the `ssdeep website `_. +`Tika <#tika>`__ Extracts metadata from the sample using `Tika `__. +`TrID <#trid>`__ Runs `TrID `__ against a file. +UAD Runs the CyberSoft Universal Atomic Disintegrator (UAD) tool, part of the `VFind Security Toolkit `_. +==================================== ======================================== + + +============================= ======================================== +Signatures +============================= ======================================== +`YaraScan <#yarascan>`__ Scans the sample with Yara and returns the results. +============================= ======================================== + +Configuration Options +--------------------- + +Parameters common to all modules are listed in the next section, followed by notes and module-specific parameters for those that have them. + +Common Parameters +^^^^^^^^^^^^^^^^^ + +The parameters below may be used by all modules. + +==================== ============================= +Parameter Description +==================== ============================= +**path** Location of the executable. +**cmdline** An array of command line options to be passed to the executable. +**host** The hostname, port, and username of the machine that will be SSH’d into to run the analytic if the executable is not present on the local machine. +**key** The SSH key to be used to SSH into the host. +**replacement path** If the main config is set to copy the scanned files this will be what it replaces the path with. It should be where the network share is mounted. +**ENABLED** When set to false, the module will not run. +==================== ============================= + +[Cuckoo] +^^^^^^^^ +This module submits a file to a Cuckoo Sandbox cluster for analysis. + +==================== ============================= +Parameter Description +==================== ============================= +**API URL** The URL to the API server. +**WEB URL** The URL to the Web server. +**timeout** The maximum time a sample will run. +**running timeout** An additional timeout, if a task is in the running state this many seconds past ``timeout``, the task is considered failed. +**delete tasks** When set to True, tasks will be deleted from Cuckoo after detonation. This is to prevent filling up the Cuckoo machine's disk with reports. +**maec** When set to True, a `MAEC `_ JSON-based report is added to Cuckoo JSON report. **NOTE**: Cuckoo needs MAEC reporting enabled to produce results. +==================== ============================= + +[ExifToolsScan] +^^^^^^^^^^^^^^^ +This module scans the file with Exif tools and returns the results. + +==================== ============================= +Parameter Description +==================== ============================= +**remove-entry** A Python list of ExifTool results that should not be included in the report. File system level attributes are not useful and stripped out. +==================== ============================= + +[FireEyeAPI] +^^^^^^^^^^^^^ +This module detonates the sample in FireEye AX via FireEye's API. This "API" version replaces the "FireEye Scan" module. + +==================== ============================= +Parameter Description +==================== ============================= +**API URL** The URL to the API server. +**fireeye images** A Python list of the VMs in fireeye. These are used to generate where to copy the files. +**username** Username on the FireEye AX. +**password** Password for the FireEye AX. +**info level** Options are concise, normal, and extended. +**timeout** The maximum time a sample will run. +**force** If set to True, will rescan if the sample matches a previous scan. +**analysis type** 0 = sandbox, 1 = live. +**application id** For AX Series appliances (7.7 and higher) and CM Series appliances that manage AX Series appliances (7.7 and higher), setting the application value to -1 allows the AX Series appliance to choose the application. For other appliances, setting the application value to 0 allows the AX Series appliance to choose the application. +==================== ============================= + +[floss] +^^^^^^^ +This module extracts ASCII, UTF-8, stack and obfuscated strings from executable files. More information about module configuration can be found at the `flare-floss `_ documentation. + +[impfuzzy] +^^^^^^^^^^ +This module calculates a fuzzy hash using ssdeep where Windows PE imports is the input. This strategy was originally described in a `blog post `_ from JPCERT/CC. + +[libmagic] +^^^^^^^^^^ +This module runs libmagic against the files. + +==================== ============================= +Parameter Description +==================== ============================= +**magicfile** The path to the compiled magic file you wish to use. If None it will use the default one. +==================== ============================= + +[Metadefender] +^^^^^^^^^^^^^^ + +This module runs Metadefender against the files. + +======================= ============================= +Parameter Description +======================= ============================= +**timeout** The maximum time a sample will run. +**running timeout** An additional timeout, if a task is in the running state this many seconds past ``timeout``, the task is considered failed. +**fetch delay seconds** The number of seconds for the module to wait between submitting all samples and polling for scan results. Increase this value if Metadefender is taking a long time to store the samples. +**poll interval** The number of seconds between successive queries to Metadefender for scan results. Default is 5 seconds. +**user agent** Metadefender user agent string, refer to your Metadefender server configuration for this value. Default is "user agent". +======================= ============================= + +[NSRL] +^^^^^^ + +This module looks up hashes in the NSRL database. These two parameters are automatically generated. Users must run nsrl_parse.py tool in the utils/ directory before using this module. + +==================== ============================= +Parameter Description +==================== ============================= +**hash_list** The path to the NSRL database on the local filesystem, containing the MD5 hash, SHA1 hash, and original file name. +**offsets** A file that contains the pointers into hash_list file. This is necessary to speed up searching of the NSRL database file. +==================== ============================= + +[officemeta] +^^^^^^^^^^^^ +This module extracts metadata from Microsoft Office documents. + +**Note**: This module does not support `OOXML `_ documents (e.g., docx, pptx, xlsx). + +[pdfinfo] +^^^^^^^^^ +This module extracts out feature information from PDF files. It uses `pdf-parser `_. + +[PEFile] +^^^^^^^^ +This module extracts out feature information from EXE files. It uses `pefile `_ which is currently not available for python 3. + +[ssdeeper] +^^^^^^^^^^ +This module generates context triggered piecewise hashes (CTPH) for the files. More information can be found on the `ssdeep website `_. + +[Tika] +^^^^^^ +This module extracts metadata from the file using `Tika `_. For configuration of the module see the `tika-python `_ documentation. + +==================== ============================= +Parameter Description +==================== ============================= +**remove-entry** A Python list of Tika results that should not be included in the report. +==================== ============================= + +[TrID] +^^^^^^ +This module runs `TrID `_ against the files. The definition file should be in the same folder as the executable. + +[vtsearch] +^^^^^^^^^^ +This module searches `virustotal `_ for the files hash and download the report if available. + +==================== ============================= +Parameter Description +==================== ============================= +**apikey** Public/private api key. Can optionally make it a list and the requests will be distributed across them. This is useful when two groups with private api keys want to share the load and reports. +==================== ============================= + +[VxStream] +^^^^^^^^^^ +This module submits a file to a VxStream Sandbox cluster for analysis. + +==================== ============================= +Parameter Description +==================== ============================= +**BASE URL** The base URL of the VxStream server. +**API URL** The URL to the API server (include the /api/ in this URL). +**API Key** The user's API key to the API server. +**API Secret** The user's secret to the API server. +**Environment ID** The environment in which to execute the sample (if you have different sandboxes configured). +**Verify** Set to false to ignore TLS certificate errors when querying the VxStream server. +**timeout** The maximum time a sample will run +**running timeout** An additional timeout, if a task is in the running state this many seconds past ``timeout``, the task is considered failed. +==================== ============================= + +[YaraScan] +^^^^^^^^^^ +This module scans the files with yara and returns the results. You will need yara-python installed for this module. + +==================== ============================= +Parameter Description +==================== ============================= +**ruledir** The directory to look for rule files in. +**fileextensions** A Python array of all valid rule file extensions. Files not ending in one of these will be ignored. +**ignore-tags** A Python array of yara rule tags that will not be included in the report. +==================== ============================= diff --git a/docs/use/use-analytics.rst b/docs/use/use-analytics.rst new file mode 100644 index 00000000..88b70378 --- /dev/null +++ b/docs/use/use-analytics.rst @@ -0,0 +1,25 @@ +.. _analytics: + +Analytics +========= + +Currently, one analytic is available. + +ssdeep Comparison +----------------- + +Fuzzy hashing is an effective method to identify similar files based on common byte strings despite changes in the byte order and structure of the files. `ssdeep `_ provides a fuzzy hash implementation and provides the capability to compare hashes. `Virus Bulletin `_ originally described a method for comparing ssdeep hashes at scale. + +Comparing ssdeep hashes at scale is a challenge. Therefore, the ssdeep analytic computes ``ssdeep.compare`` for all samples where the result is non-zero and provides the capability to return all samples clustered based on the ssdeep hash. + +Elasticsearch +^^^^^^^^^^^^^ +When possible, it can be effective to push work to the Elasticsearch cluster which support horizontal scaling. For the ssdeep comparison, Elasticsearch `NGram Tokenizers `_ are used to compute 7-grams of the chunk and double-chunk portions of the ssdeep hash, as described `here `_. This prevents the comparison of two ssdeep hashes where the result will be zero. + +Python +^^^^^^ +Because we need to compute ``ssdeep.compare``, the ssdeep analytic cannot be done entirely in Elasticsearch. Python is used to query Elasticsearch, compute ``ssdeep.compare`` on the results, and update the documents in Elasticsearch. + +Deployment +^^^^^^^^^^ +`celery beat `_ is used to schedule and kick off the ssdeep comparison task nightly at 2am local time, when the system is experiencing less load from users. This ensures that the analytic will be run on all samples without adding an exorbinant load to the system. diff --git a/docs/web.md b/docs/use/web-ui.rst similarity index 51% rename from docs/web.md rename to docs/use/web-ui.rst index 71f3bfee..533fa4ee 100644 --- a/docs/web.md +++ b/docs/use/web-ui.rst @@ -1,108 +1,198 @@ -# Web Interface # +.. _web-ui: -Submit Files for Analysis -------------------------- +Web UI +====== -![MultiScanner Web Interface](imgs/Selection_001.png) +Submitting Files for Analysis +------------------------------ When you visit MultiScanner's web interface in a web browser, you'll be greeted by the file submission page. Drag files onto the large drop area in the middle of the page or click it or the "Select File(s)..." button to select one or more files to be uploaded and analyzed. +.. image:: ../_static/img/Selection_001.png + :align: center + :scale: 50 % + :alt: MultiScanner Web Interface + Click on the "Advanced Options" button to change default options and set metadata fields to be added to the scan results. -![Advanced Options](imgs/Selection_003.png) +.. image:: ../_static/img/Selection_003.png + :align: center + :scale: 50 % + :alt: Advanced Options Metadata fields can be added or removed by editing web_config.ini. Metadata field values can be set for individual files by clicking the small black button below and to the right of that filename in the staging area. -![File Options](imgs/Selection_004.png) +.. image:: ../_static/img/Selection_004.png + :align: center + :scale: 60 % + :alt: File Options Change from "Scan" to "Import" to import JSON analysis reports into MultiScanner. This is intended only to be used with the JSON reports you can download from a report page in MultiScanner. -![Import](imgs/Selection_005.png) +.. image:: ../_static/img/Selection_005.png + :align: center + :scale: 60 % + :alt: Import By default, if you resubmit a sample that has already been submitted, MultiScanner will pull the latest report of that sample. If you want MultiScanner to re-scan the sample, set that option in Advanced Options. -![Re-scan](imgs/Selection_006.png) +.. image:: ../_static/img/Selection_006.png + :align: center + :scale: 60 % + :alt: Re-scan If you have a directory of samples you wish to scan at once, we recommend zipping them and uploading the archive with the option to extract archives enabled. You can also specify a password, if the archive file is password- protected. Alternatively you can use the REST API for bulk uploads. -![Archive Files](imgs/Selection_007.png) +.. image:: ../_static/img/Selection_007.png + :align: center + :scale: 60 % + :alt: Archive Files Click the "Scan it!" button to submit the sample to MultiScanner. -![Scan It!](imgs/Selection_008.png) +.. image:: ../_static/img/Selection_008.png + :align: center + :scale: 50 % + :alt: Scan It The progress bars that appear in the file staging area do not indicate the progress of the scan; a full bar merely indicates that the file has been uploaded to MultiScanner. Click on the file to go to its report page. -![Submission Progress Bar](imgs/Selection_009.png) +.. image:: ../_static/img/Selection_009.png + :align: center + :scale: 50 % + :alt: Submission Progress Bar If the analysis has not completed yet, you'll see a "Pending" message. -![Pending](imgs/Selection_010.png) +.. image:: ../_static/img/Selection_010.png + :align: center + :scale: 50 % + :alt: Pending -Analyses and History Pages --------------------------- +Viewing Analyses +---------------- Reports can be listed and searched in two different ways. The Analyses page lists the most recent report per sample. -![Analyses Page](imgs/Selection_011.png) +.. image:: ../_static/img/Selection_011.png + :align: center + :scale: 50 % + :alt: Analyses Page The History page lists every report of each sample. So if a file is scanned multiple times, it will only show up once on the Analyses page, but all of the reports will show up on the History page. -![History Page](imgs/Selection_012.png) +.. image:: ../_static/img/Selection_012.png + :align: center + :scale: 50 % + :alt: History Page Both pages display the list of reports and allow you to search them. Click the blue button in the middle to refresh the list of reports. -![Refresh Button](imgs/Selection_013.png) +.. image:: ../_static/img/Selection_013.png + :align: center + :scale: 50 % + :alt: Refresh Button Click on a row in the list to go to that report, and click the red "X" button to delete that report from MultiScanner's Elasticsearch database. -![Delete Button](imgs/Selection_014.png) +.. image:: ../_static/img/Selection_014.png + :align: center + :scale: 50 % + :alt: Delete Button Searching --------- -![Navbar Search](imgs/Selection_015.png) +Reports can be searched from any page, with a few options. You can search Analyses to get the most recent scan per file, or search History to get all scans recorded for each file. + +.. image:: ../_static/img/Selection_015.png + :align: center + :scale: 50 % + :alt: Navbar Search + +* Use the "Default" search type to have wildcards automatically appended to the beginning and end of your search term. + +* Use the "Exact" search type to search automatically append quotes and search for the exact phrase. + +* Use the "Advanced" search type to search with the full power of Lucene query string syntax. Nothing will be automatically appended and you will need to escape any reserved characters yourself. -Reports can be searched from any page, with a few options. You can search Analyses to get the most recent scan per file, or search History to get all scans recorded for each file. Use the "Default" search type to have wildcards automatically appended to the beginning and end of your search term. Use the "Exact" search type to search automatically append quotes and search for the exact phrase. Finally, use the "Advanced" search type to search with the full power of Lucene query string syntax. Nothing will be automatically appended and you will need to escape any reserved characters yourself. When you click on one of the search results, the search term will be highlighted on the Report page and the report will be expanded and automatically scrolled to the first match. +When you click on a search result, the search term will be highlighted on the Report page and the report will be expanded and automatically scrolled to the first match. -![Analyses/History Search](imgs/Selection_016.png) +.. image:: ../_static/img/Selection_016.png + :align: center + :scale: 50 % + :alt: History Search -Report page ------------ +Viewing Reports +--------------- -![Report Page](imgs/Selection_017.png) +Each report page displays the results of a single analysis. -Each report page displays the results of a single analysis. Some rows in the report can be expanded or collapsed to reveal more data by clicking on the row header or the "Expand" button. Shift-clicking will also expand or collapse all of it's child rows. +.. image:: ../_static/img/Selection_017.png + :align: center + :scale: 50 % + :alt: Report Page -![Expand Button](imgs/Selection_024.png) +Some rows in the report can be expanded or collapsed to reveal more data by clicking on the row header or the "Expand" button. Shift-clicking will also expand or collapse all of it's child rows. + +.. image:: ../_static/img/Selection_024.png + :align: center + :scale: 60 % + :alt: Expand Button The "Expand All" button will expand all rows at once. If they are all expanded, this will turn into a "Collapse All" button that will collapse them all again. -![Expand All Button](imgs/Selection_018.png) +.. image:: ../_static/img/Selection_018.png + :align: center + :scale: 50 % + :alt: Expand All Button As reports can contain a great deal of content, you can search the report to find the exact data you are looking for with the search field located under the report title. The search term, if found, will be highlighted, the matching fields will be expanded, and the page automatically scrolled to the first match. -![In-Page Search](imgs/Selection_019.png) +.. image:: ../_static/img/Selection_019.png + :align: center + :scale: 50 % + :alt: In-Page Search Reports can be tagged by entering text in the Tags input box and hitting the enter key. As you type, a dropdown will appear with suggestions from the tags already in the system. It will pull the list of tags from existing reports, but a pre-populated list of tags can also be provided in web_config.ini when the web interface is set up. -![Tags](imgs/Selection_020.png) +.. image:: ../_static/img/Selection_020.png + :align: center + :scale: 50 % + :alt: Tags You can download the report in a number of different formats using the Download button on the right side. You can download a JSON-formatted version of the report containing all the same data shown on the page. You can also download a MAEC-formatted version of the reports from Cuckoo Sandbox. Finally, you can also download the original sample file as a password-protected ZIP file. The password will be "infected". -![Download](imgs/Selection_021.png) +.. image:: ../_static/img/Selection_021.png + :align: center + :scale: 50 % + :alt: Download Click on "Notes" to open a sidebar where analysts may enter notes or comments. -![Notes](imgs/Selection_022.png) +.. image:: ../_static/img/Selection_022.png + :align: center + :scale: 50 % + :alt: Notes These notes and comments can be edited and deleted. Click the "<" button to collapse this sidebar. -![Close Notes](imgs/Selection_023.png) +.. image:: ../_static/img/Selection_023.png + :align: center + :scale: 50 % + :alt: Close Notes -Analytics ---------- +Viewing Analytics +----------------- + +The Analytics page displays various pieces of advanced analysis. For now, this is limited to ssdeep comparisons. + +.. image:: ../_static/img/Selection_002.png + :align: center + :scale: 50 % + :alt: Analytics Page +.. -![Analytics Page](imgs/Selection_002.png) + The table lists samples, with those that have very similar ssdeep hashes grouped together. Other analytics will be added in the future. For more information, see the `analytics`_ section. -The Analytics page displays various pieces of advanced analysis. For now, this is limited to ssdeep comparisons. The table lists samples, with those that have very similar ssdeep hashes grouped together. Other analytics will be added in the future. For more information, see [this page](../docs/analytics.md). +.. _analytics: analytics