Skip to content

Commit

Permalink
Merge pull request #106 from mitre/feature-celery
Browse files Browse the repository at this point in the history
Update MultiScanner documentation
  • Loading branch information
awest1339 authored Apr 4, 2018
2 parents 747e800 + 4eaedc5 commit 12e4f92
Show file tree
Hide file tree
Showing 84 changed files with 1,309 additions and 541 deletions.
131 changes: 68 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,83 +6,88 @@ Introduction
------------
MultiScanner is a file analysis framework that assists the user in evaluating a set
of files by automatically running a suite of tools for the user and aggregating the output.
Tools can be custom built python scripts, web APIs, software running on another machine, etc.
Tools can be custom built Python scripts, web APIs, software running on another machine, etc.
Tools are incorporated by creating modules that run in the MultiScanner framework.

Modules are designed to be quickly written and easily incorporated into the framework.
Currently written and maintained modules are related to malware analytics, but the framework is not limited to that
scope. For a list of modules you can look in [modules](modules), descriptions and config
options can be found in [docs/modules.md](docs/modules.md)
scope. For a list of modules you can look in [modules/](modules). Descriptions and config
options can be found on the [Analysis Modules](http://multiscanner.readthedocs.io/en/latest/use/use-analysis-mods.html) page.

Requirements
------------
Python 3.6 is recommended. Compatibility with 2.7+ and
3.4+ is supported but not as thoroughly maintained and tested. Please submit an issue
or a pull request fixing any issues found with other versions of Python.
MultiScanner also supports a distributed workflow for sample storage, analysis, and
report viewing. This functionality includes a web interface, a REST API, a distributed
file system (GlusterFS), distributed report storage / searching (Elasticsearch), and
distributed task management (Celery / RabbitMQ). Please see [Architecture](http://multiscanner.readthedocs.io/en/latest/arch.html) for more details.

Usage
-----

An installer script is included in the project [install.sh](<install.sh>), which
installs the prerequisites on most systems.
MultiScanner can be used as a command-line interface, a Python API, or a
distributed system with a web interface. See the documentation for more detailed
information on [installation](http://multiscanner.readthedocs.io/en/latest/install.html) and [usage](http://multiscanner.readthedocs.io/en/latest/use/index.html).

Installation
------------
### MultiScanner ###
If you're running on a RedHat or Debian based linux distribution you should try and run
[install.sh](<install.sh>). Otherwise the required python packages are defined in
[requirements.txt](<requirements.txt>).

MultiScanner must have a configuration file to run. Generate the MultiScanner default
configuration by running `python multiscanner.py init` after cloning the repository.
This command can be used to rewrite the configuration file to its default state or,
if new modules have been written, to add their configuration to the configuration
file.

### Analytic Machine ###
Default modules have the option to be run locally or via SSH. The development team
runs MultiScanner on a Linux host and hosts the majority of analytical tools on
a separate Windows machine. The SSH server used in this environment is freeSSHd
from <http://www.freesshd.com/>.

A network share accessible to both the MultiScanner and the Analytic Machines is
required for the multi-machine setup. Once configured, the network share path must
be identified in the configuration file, config.ini. To do this, set the `copyfilesto`
option under `[main]` to be the mount point on the system running MultiScanner.
Modules can have a `replacement path` option, which is the network share mount point
on the analytic machine.

Module Writing
--------------
Modules are intended to be quickly written and incorporated into the framework.
A finished module must be placed in the modules folder before it can be used. The
configuration file does not need to be manually updated. See [docs/module\_writing.md](<docs/module_writing.md>)
for more information.

Module Configuration
--------------------
Modules are configured within the configuration file, config.ini. See
[docs/modules.md](<docs/modules.md>) for more information.

Python API
----------
MultiScanner can be incorporated as a module in another projects. Below is a simple
example of how to import MultiScanner into a Python script.
### Command-Line ###

Install Python (2.7 or 3.4+) if you haven't already.

Then run the following (substituting the actual file you want to scan for `<file>`):

``` bash
$ git clone https://github.com/mitre/multiscanner.git
$ cd multiscanner
$ sudo -HE ./install.sh
$ python multiscanner.py init
```

This will generate a default configuration for you. Check `config.ini` to see what
modules are enabled. See [Configuration](http://multiscannerdocs.readthedocs.io/en/latest/install.html#configuration) for more information.

Now you can scan a file (substituting the actual file you want to scan for `<file>`):

``` bash
$ python multiscanner.py <file>
```

You can run the following to get a list of all of MultiScanner's command-line options:

``` bash
$ python multiscanner.py --help
```

**Note**: If you are not on a RedHat or Debian based Linux distribution, instead of
running the `install.sh` script, install pip (if you haven't already) and run the
following:

``` bash
$ pip install -r requirements.txt
```

### Python API ###

``` python
import multiscanner
output = multiscanner.multiscan(FileList)
Results = multiscanner.parse_reports(output, python=True)
multiscanner.config_init(filepath)
output = multiscanner.multiscan(file_list)
results = multiscanner.parse_reports(output, python=True)
```

Results is a dictionary object where each key is a filename of a scanned file.
### Web Interface ###

Install the latest versions of [Docker](https://docs.docker.com/engine/installation/)
and [Docker Compose](https://docs.docker.com/compose/install/) if you haven't already.

``` bash
$ git clone https://github.com/mitre/multiscanner.git
$ cd multiscanner
$ docker-compose up
```

`multiscanner.config_init(filepath)` will create a default configuration file at
the location defined by filepath.
You may have to wait a while until all the services are up and running, but then you
can use the web interface by going to `http://localhost:8000` in your web browser.

Distributed MultiScanner
------------------------
MultiScanner is also part of a distributed, scalable file analysis framework, complete with distributed task management, web interface, REST API, and report storage. Please set [Distributed Multiscanner](<docs/distributed_multiscanner.md>) for more details. Additionally, we distribute a standalone Docker container with the base set of features (web UI, REST API, ElasticSearch node) as an introduction to the capabilities of this Distributed MultiScanner. See [here](<docs/docker_standalone.md>) for more details. (*Note*: this standalone container should not be used in production, it is simply a primer on what a full installation would look like).
*Note*: this should not be used in production; it is simply an introduction to what a
full installation would look like. See [here](http://multiscanner.readthedocs.io/en/latest/install.html#standalone-docker-installation) for more details.

Other Reading
Documentation
-------------
For more information on module configuration or writing modules check the
[docs](<docs>) folder.
For more information, see the [full documentation](http://multiscanner.readthedocs.io/) on ReadTheDocs.
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SPHINXPROJ = MultiScanner
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Binary file added docs/_static/img/Selection_001.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_002.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_003.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_004.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_005.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_006.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_007.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_008.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_009.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_010.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_011.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_012.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_013.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_014.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_015.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_016.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_017.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_018.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_019.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_020.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_021.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_022.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_023.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Selection_024.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/arch1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/arch2.png
Binary file added docs/_static/img/overview.png
13 changes: 13 additions & 0 deletions docs/_static/theme_overrides.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
/* override table width restrictions */
@media screen and (min-width: 767px) {

.wy-table-responsive table td {
/* !important prevents the common CSS stylesheets from overriding
this as on RTD they are loaded after this stylesheet */
white-space: normal !important;
}

.wy-table-responsive {
overflow: visible !important;
}
}
47 changes: 0 additions & 47 deletions docs/analytics.md

This file was deleted.

104 changes: 104 additions & 0 deletions docs/arch.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
Architecture
============

High-level Architecture
-----------------------
There are seven primary components of the MultiScanner architecture, as described below and illustrated in the associated diagram.

.. figure:: _static/img/arch1.png
:align: center
:scale: 45 %
:alt: MultiScanner Architecture

MultiScanner Architecture
..
**Web Frontend**

The web application runs on `Flask <http://flask.pocoo.org/>`_, uses `Bootstrap <https://getbootstrap.com/>`_ and `jQuery <https://jquery.com/>`_, and is served via Apache. It is essentially an aesthetic wrapper around the REST API. All data and services provided are also available by querying the REST API.


**REST API**

The REST API is also powered by Flask and served via Apache. It has an underlying PostgreSQL database to facilitate task tracking. Additionally, it acts as a gateway to the backend Elasticsearch document store. Searches entered into the web UI will be routed through the REST API and passed to the Elasticsearch cluster. This abstracts the complexity of querying Elasticsearch and gives the user a simple web interface to work with.

**Task Queue**

We use Celery as our distributed task queue.

**Task Tracking**

PostgreSQL is our task management database. It is here that we keep track of scan times, samples, and the status of tasks (pending, complete, failed).

**Distributed File System**

GlusterFS is our distributed file system. Each component that needs access to the raw samples mounts the share via FUSE. We selected GlusterFS because it is more performant in our use case -- storing a large number of small samples -- than a technology like HDFS would be.

**Worker Nodes**

The worker nodes are Celery clients running the MultiScanner Python application. Additionally, we implemented some batching within Celery to improve the performance of our worker nodes (which operate better at scale).

A worker node will wait until there are 100 samples in its queue or 60 seconds have passed (whichever happens first) before kicking off its scan (these values are configurable). All worker nodes have the GlusterFS mounted, which gives access to the samples for scanning. In our setup, we co-locate the worker nodes with the GlusterFS nodes in order to reduce the network load of workers pulling samples from GlusterFS.

**Report Storage**

We use Elasticsearch to store the results of our file scans. This is where the true power of this system lies. Elasticsearch allows for performant, full text searching across all our reports and modules. This allows fast access to interesting details from your malware analysis tools, pivoting between samples, and powerful analytics on report output.

.. _complete-workflow:

Complete Workflow
-----------------
Each step of the MultiScanner workflow is described below the diagram.

.. figure:: _static/img/arch2.png
:align: center
:scale: 50 %
:alt: MultiScanner Workflow

MultiScanner Workflow
..
1. The user submits a sample file through the Web UI (or REST API)

2. The Web UI (or REST API):

a. Stores the file in the distributed file system (GlusterFS)
b. Places the task on the task queue (Celery)
c. Adds an entry to the task management database (PostgreSQL)

3. A worker node:

a. Pulls the task from the Celery task queue
b. Retrieves the corresponding sample file from the GlusterFS via its SHA256 value
c. Analyzes the file
d. Generates a JSON blob and indexes it into Elasticsearch
e. Updates the task management database with the task status ("complete")

4. The Web UI (or REST API):

a. Gets report ID associated with the Task ID
b. Pulls analysis report from the Elasticsearch datastore

Analysis
--------
Analysis tools are integrated into MultiScanner via modules running in the MultiScanner framework. Tools can be custom built Python scripts, web APIs, or software applications running on different machines. Catagories of existing modules include AV scanning, sandbox detonation, metadata extraction, and signature scanning. Modules can be enabled/disabled via a configuration file. Details are provided in the :ref:`analysis-modules` section.

Analytics
---------
Enabling analytics and advanced queries is the primary advantage of running several tools against a sample, extracting as much information as possible, and storing the output in a common datastore. For example, the following types of analytics and queries are possible:

* cluster samples
* outlier samples
* samples for deep-dive analysis
* gaps in current toolset
* machine learning analytics on tool outputs

Reporting
---------
Analysis data captured or generated by MultiScanner is accessible in three ways:

* MultiScanner Web User Interface – Content in the Elasticsearch database is viewable through the Web UI. See :ref:`web-ui` section for details.

* MultiScanner Reports – MultiScanner reports reflect the content of the MultiScanner database and are provided in raw JSON and PDF formats. These reports capture all content associated with a sample.

* STIX-based reports *will soon be* available in multiple formats: JSON, PDF, HTML, and text.
Loading

0 comments on commit 12e4f92

Please sign in to comment.