Copyright 2014 Nedim Srndic, University of Tuebingen [email protected]
Please consult the INSTALL.rst
file.
Hidost consists of two parts: one for PDF and another one for SWF files. These two parts are used to extract features from these two different file types and they both provide data files (in LibSVM format) as output. However, here we will describe them separately as their implementations and ways of use have little in common.
The PDF part was written in C++11 and consists of a toolchain of
executables. It requires as input a text file with a list of paths to
benign PDF files, one per line, and another text file with malicious
PDF files. We will refer to these files as bpdfs.txt
and
mpdfs.txt
in this description. The output of this toolchain is a
file (data.libsvm
) in LibSVM input format that contains the feature
vectors of both benign and malicious PDF files listed in bpdfs.txt
and mpdfs.txt
.
Follow these steps to obtain the data file:
Prepare the files
bpdfs.txt
andmpdfs.txt
.Position in the directory where Hidost has been built, as detailed in
INSTALL.rst
:cd build/In order to avoid having to extract tree structures from PDF files multiple times, we will cache the extracted structures in cache directories, for benign and malicious files separately:
mkdir cache-ben cache-mal ./src/cacher -i bpdfs.txt --compact --values -c cache-ben/ \ -t10 -m256 ./src/cacher -i mpdfs.txt --compact --values -c cache-mal/ \ -t10 -m256We will need the absolute paths of all non-empty cached PDF structures in the following steps:
find $PWD/cache-ben -name '*.pdf' -not -empty >cached-bpdfs.txt find $PWD/cache-mal -name '*.pdf' -not -empty >cached-mpdfs.txt cat cached-bpdfs.txt cached-mpdfs.txt >cached-pdfs.txtNow we will count in how many PDF files each of the PDF structural paths occur:
./src/pathcount -i cached-pdfs.txt -o pathcounts.binThe next step is feature selection. We will only take into account structural paths present in at least 1,000 PDF files in our dataset:
./src/feat-select -i pathcounts.bin -o features.nppf -m1000Finally, we will extract the selected features from all files and store the result in the output file
data.libsvm
:./src/feat-extract -b cached-bpdfs.txt -m cached-mpdfs.txt \ -f features.nppf --values -o data.libsvm
The output file data.libsvm
can now be used for learning and
classification.
The SWF part was written in Python 2.7 and Java. It requires as input
a text file with a list of paths to benign PDF files, one per line,
and another text file with malicious PDF files. We will refer to these
files as bpdfs.txt
and mpdfs.txt
in this description. The output
of this toolchain is a file (data.libsvm
) in LibSVM input format
that contains the feature vectors of both benign and malicious PDF files
listed in bpdfs.txt
and mpdfs.txt
.
Follow these steps to obtain the data file:
Prepare the files
bpdfs.txt
andmpdfs.txt
.Position in the directory with Hidost Python source code,
hidost/hidost/
:cd hidost/Use the Python script
feat_extract.py
to extract all features from all SWF files. The features will be pickled to a file calledfeatures.pickle
and the feature vectors will be saved in the output filedata.libsvm
:python feat_extract.py -b bswfs.txt -m mswfs.txt \ -s features.pickle -o data.libsvm
The output file data.libsvm
can now be used for learning and
classification.
Hidost is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Hidost is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with Hidost. If not, see <http://www.gnu.org/licenses/>.