pip3 install -r requirements.txt
python3 main.py -s <slips_db_abs_path> -e <eve.json_abs_path> -g <ground_truth_labeled_zeek_dir_abs_path>
for testing, use this command:
Example of using labeled ground truth dir
python3 main.py -gtd
Example of using ground truth file
python3 main.py -gtf
python3 main.py -s
The tool needs the following 3 to run:
- Slips db
- suricata eve.json
- a labeled conn.log file
Slips stores the AID hash for each conn.log flow in the sqlite db
The SQL table with the AID and label in Slips is called 'flows' inside the flows.sqlite
db
This tool reads the flows.sqlite
db, extracts the labels and AIDs,
and stores them in its' own db stored in output/<date-time>/db.sqlite
it calculates the aid of each read flow on the fly using the (zeek's community_id + ts) combination by using the aid_hash lib from pypi https://pypi.org/project/aid-hash
This tool reads Suricata's eve.json file
if the field event_type is set to 'alert', this tool marks this flow as malicious by suricata.
it calculates the aid of each read flow on the fly using the (community_id + ts) combination by using the aid_hash lib from pypi https://pypi.org/project/aid-hash
The output directory of this tool can be specified using -o, for example:
python3 main.py -e suricata/eve.json -s slips/flows.sqlite -gtf zeek/conn.log.labeled -o some_dir
if -o is not given, it creates a new dir inside output/
with the date and time of the run.
The output of this tool consists of:
- a sqlite db with labels per flow, and labels per time window, it also has the performance errors and, total flows count and the files read of each tool.
The sqlite db created by this tool is stored in a subdir in the output/ dir
for example
output/2023-07-10-14:04:16
-
the metrics printed in the CLI and to the output directory in results.txt at the end of the analysis
-
a metadata file with the versions and files used
- This tool consists of 3 parsers, the ground truth parser runs first, once it's done the slips and suricata parsers start in parallel
- The 3 parsers job is to store the label for each tool in the sqlite database
- labels are stored per flow and per timewindow
- the tool then retrieves the actual and predicted value of each of the given tools and passes them to the calculator for calculating the metrics
The below diagram shows how we parse each tool and the ground truth given file.
Parsers result in the following tables
Timewindow labels are detected in the following way:
A timewindow is 1h interval, the given pcap is split into as many 1h intervals as needed and each interval (timewindow) has 1 label, either malicious or benign
the slips database given to this tool using -s contains a table called alerts where slips stores the malcious timewindows with their label, start and end date.
we read 1h worth of flows, once we find one 'malicious' label, we consider their entire timewindow as malicious, if there is no malicious flows in there, we mark that timewindow as benign
Same as the ground truth. we read 1h worth of flows, once we find one 'malicious' label, we consider their entire timewindow as malicious, if there is no malicious flows in there, we mark that timewindow as benign
If a timewindow was detected by one of the tools, and not detected by the ground truth, for example negative timewindows in slips, we consider the gt label of it as "benign"
The slips database given to this tool using -s contains a table called flows where each flow is stored with its label. The flow is considered malicious by slips if it was part of an alert. Slips detects alerts based on a complex ensembling algorithm, check it Slips documentation for more about this.
The eve.json given to this tool using -e contains flows and event_type = 'alert'. Each alerts is marked as malicious and each flow is marked as benign
Ground truth flows are labeled using the netflow labeler. so each flow has a label either benign or malicious
-
the labels in ground truth zeek dir have to be 'Malicious' or 'Benign' only. if any other label is present this tool will completely discard the flow.
-
ground truth dirs can either be json or tab separated zeek dir or conn.log file
-
all paths given as parameters to this tool must be absolute paths.
-
if any flow doesn't have a label by suricata or slips, this tool considers the flow as benign
-
slips now labels conn.log flows only, just like zeek does when zeek's community_id is enabled as a plugin
-
all flows and timewindows read by a tool, that don't have a matching flow/timewindow in the ground truth file, are discarded. The number of discarded flows is written in the cli at the end of the analysis.
-
we only read event_type= "flow" or "alert" in suricata eve.json files
-
the flows read by suricata, slips and the gt don't have to be the same, meaning that, the final flows count don't have to match because each tool reads the pcap differently
-
timewindow numbers may be negative if a flow is found with a flow timestamp < timestamp of the first flow seen
-
if a slips alert exists in parts of 2 timewindows
-
what we're doing here is marking bith timewindows as malicious
1:30 2:30
│ slips alert │
├──────────────────────┤
1:00 2:00 3:00
├───────────────────────────┼────────────────────────────┤
│ tw 1 tw 2 │
- When given a PCAP with 24hs of flows, why are the sum of the ByTimewindow comparison not equal 24? Because each IP seen in the traffic, will have a label in each of the timewindows. Check the labels_per_tw table in the generated db.sqlite and sort by IPs for verification.
- command for generating all zeek files in the dataset/
zeek -C -r <pcap> tcp_inactivity_timeout=60mins tcp_attempt_delay=1min
- command for labeling conn.log files
python3 netflowlabeler.py -c labels.config -f /path/to/generated/conn.log
- (optional) To label the rest of the Zeek files using an already labeled conn.log file (conn.log.labeled)
zeek-files-labeler.py -l conn.log.labeled -f folder-with-zeek-log-files
Checkout the scripts used for the comparison @ https://github.com/stratosphereips/Scripts-Used-for-Comparison