diff --git a/README.md b/README.md index 1ce6908..99fadc6 100644 --- a/README.md +++ b/README.md @@ -1,21 +1,19 @@ # IRC-Zeek-package -Zeek Package that extracts features of IRC communication that is automatically recognized from pcap file. +IRC Feature Extractor Zeek Package extends the functionality of Zeek network analysis framework. This package automatically recognizes IRC communication in a packet capture (pcap) file and automatically extract features from it. +The goal for the feature extraction is to describe an individual IRC communications that occur in the pcap file as accurately as possible. ## Installation -To install the package do the following in the package directory: +To install the package using [Zeek Package Manager](https://packages.zeek.org), run the following command: ```bash -$ git clone git@github.com:stratosphereips/IRC-Zeek-package.git -$ cd IRC-Zeek-package -$ zkg install . +$ zkg install IRC-Zeek-package ``` ## Run -To extract the IRC features on selected pcap file that contains IRC, the only thing that you need to do is to run following command in terminal: +To extract the IRC features on the selected pcap file that contains IRC, run the following command in a terminal: ```bash -$ zeek -r file.pcap irc_feature_extractor +$ zeek IRC-Zeek-package -r file.pcap ``` -output will be redirected `irc_features.log` file in zeek log format. +The output will be stored in `irc_features.log` file in zeek log format. The log will look like this: -### Example output log ``` #separator \x09 #set_separator , @@ -31,7 +29,12 @@ T!T@null 192.168.100.103 33 #a925d765 185.61.149.22 2407 1530166710.153128 15356 ``` -### Parsing log output in Python +### Parsing log in Python + +Instead of parsing the package manually, you can use the [ZAT](https://github.com/SuperCowPowers/zat) library in Python to parse it directly into a Pandas data frame or as a list of dictionaries. + +There is an example of how to parse the log as a list of dictionaries: + ```python import zat @@ -51,35 +54,41 @@ for log in reader.readrows(): ``` ## Description -To be able to use IRC data, we separated the IRC communication into the connections between source IP, destination IP and destination port (hereinafter IRC connection)(see Figure below). The source port is neglected because we wanted to to merge communication that was splitted to more TCP connections. When the TCP connection is made, the source port is randomly generated from the unregistered port range and thus we needed to neglect the source port to match the IRC connection with the same source IP, destination IP and destination port. -IRC connection consists of informations that are needed. +Once the data was obtained from network traffic capture, there was a process to extract the features. We separated the whole pcap into communications for each individual user. To do that, we separated communication into the connections between the source IP, destination IP, and destination port (hereinafter IRC connection). The source port is randomly chosen from the unregistered port range, and that is why the source port is not the same when a new TCP connection is established between the same IP addresses. For this reason, we neglected the source port to match the IRC connection with the same source IP, destination IP, and destination port. This is shown in figure below, where are two connections from the source IP address (192.168.0.1), to the same destination IP address (192.168.0.2) using different source port. + -![IRC Connection Scheme](figs/irc-connection.png) +![alt](figs/irc-connection.png) + +Example of IRC connection - IRC connection that is defined by source IP address 192.168.0.1, destination IP address 192.168.0.2, and destination port 440. Source port is neglected, and therefore one IRC connection can have multiple source ports. The IP addresses and ports are chosen randomly for demonstration purposes. ## Extracted Features The feature selection is made manually to provide a good means of characterizing malicious communication. Features were computed for each IRC connection. Here is a final list of features that we used in our models. ### Total Packet Size -Total data packets' size in bytes. +Size of all packets in bytes that were sent in IRC connection. It reflects how many messages were sent and how long they were. ### Session Duration -Duration of IRC connection in milliseconds. +Duration of IRC connection in milliseconds - i.e., the difference between the time of the last message and the first message in IRC connection. ### Number of Messages -Total number of messages in IRC connection. +A total number of messages in IRC connection. ### Number of Source Ports -Since the source port is neglected in unifying communication into sessions, the source address can use different port per TCP connection when the port is randlomly chosen. We suppose that artificial user could have higher number of source ports than the real user since the number of connections of the artificial user could be higher than the number of connections of the real user. +As we have mentioned before, the source port is neglected in unifying communication into IRC connections because the it is randomly chosen when a TCP connection is established. We suppose that artificial users could have had a higher number of source ports than the real users since the number of connections of the artificial users was higher than the number of connections of the real users. ### Message Periodicity -To compute message periodicity, we firstly compute time differences between every message. On this computed sequence of numbers, we apply a fast Fourier transform (FFT). Fast Fourier transform is an effective algorithm for computing discrete Fourier transform, which we are using to express time sequence as a sum of periodic components and for recovering signal from those components. The output of FFT is a sequence of numbers with the same length as the input. The higher the number on a given position of the output is, the bigger the amplitude on the given position is, and thus it has a more significant influence on the periodicity of the data. The position of the largest element in the FFT's output represents the length of the period which occurrence is the most probable from all other periods. +We suppose that artificial users (e.g., bots that are controlled by botnet master) use IRC for sending commands periodically, so we wanted to obtain that value. To do that, we created a method that would return a number between 0 and 1 - i.e. one if the message sequence is perfectly periodical, zero if the message sequence is not periodical at all. + +To compute message periodicity, we firstly compute time differences between every message. On this computed sequence of numbers, we apply a fast Fourier transform (FFT). The output of FFT is a sequence of numbers. The higher the number on the given position of the output, the bigger the amplitude on the given position.Thus it has a more significant influence on the periodicity of the data. +The position of the largest element in the FFT's output represents the length of the period, which is the most significant from all other periods. -To compute the quality of the period, we split the data by the length of a period. Then we compute the normalised mean squared error (NMSE) that returns us the resulting number in the interval between 0 and 1 where 1 represents the perfectly periodic messages, and 0 represents not periodic messages at all. +To compute the quality of the most significant period, we split the data by length of that period.. Then we compute the normalised mean squared error (NMSE) that returns us the resulting number in the interval between 0 and 1 where 1 represents the perfectly periodic messages, and 0 represents not periodic messages at all. ![](figs/formula_per.gif) ### Message Word Entropy -To take into account whether the user is sending the same message multiple times in a row, or whether the message contains limited number of words, we compute word entropy across all messages in IRC connection. For computation of word entropy we are using formula below, +To consider whether the user sends the same message multiple times in a row, or whether the message contains a limited number of words, we compute a word entropy across all of the messages in the IRC connection. By the term word entropy we mean a measure of words uncertainty in the message. For the computation of the word entropy, we use the formula below: ![](figs/formula_entropy.gif) -where n represents number of words and p_i represents the probability that word $i$ is used among all other words. +where n represents the number of words, and p_i represents the probability that the word i will be used among all other words. ### Username Special Characters Mean -Average usage of non-alphabetic characters in username. +We want to obtain whether the username of the user in the IRC communication is random generated or not. Therefore, in this feature, we compute the average usage of non-alphabetic characters in the username. + ### Message Special Characters Mean -Average usage of non-alphabetic character across all messages in IRC connection. +With this feature, we obtain the average usage of non-alphabetic characters across all messages in the IRC connection. We apply the same procedure of matching special characters for each message as in the previous case - we match non-alphabetic characters by regex, and then we divide the number of matched characters by the total number of message characters. Finally, we compute an average of all the obtained values for each message. diff --git a/__load__.zeek b/__load__.zeek new file mode 100644 index 0000000..097d8d8 --- /dev/null +++ b/__load__.zeek @@ -0,0 +1 @@ +@load ./irc_feature_extractor.zeek diff --git a/figs/formula_per_blackbox.png b/figs/formula_per_blackbox.png new file mode 100644 index 0000000..3da3018 Binary files /dev/null and b/figs/formula_per_blackbox.png differ diff --git a/figs/irc-rgx.png b/figs/irc-rgx.png new file mode 100644 index 0000000..679583c Binary files /dev/null and b/figs/irc-rgx.png differ diff --git a/figs/periodicity_sketch.png b/figs/periodicity_sketch.png new file mode 100644 index 0000000..a84d271 Binary files /dev/null and b/figs/periodicity_sketch.png differ diff --git a/figs/zkg-logo.png b/figs/zkg-logo.png new file mode 100644 index 0000000..ec769e8 Binary files /dev/null and b/figs/zkg-logo.png differ diff --git a/scripts/__load__.zeek b/scripts/__load__.zeek deleted file mode 100644 index a9add5d..0000000 --- a/scripts/__load__.zeek +++ /dev/null @@ -1 +0,0 @@ -event zeek_init() { print "irc feature extractor is loaded"; } diff --git a/zkg.meta b/zkg.meta index b5d073b..1373bc2 100644 --- a/zkg.meta +++ b/zkg.meta @@ -1,7 +1,3 @@ [package] - description = Zeek Package that extracts features of IRC communication - -tags = zeek plugin, zeekctl plugin, irc, features extraction - -script_dir = scripts +tags = zeek plugin, irc, features extraction