Code for the paper "Cost-Aware Robust Tree Ensembles for Security Applications" (USENIX Security'21), Yizheng Chen, Shiqi Wang, Weifan Jiang, Asaf Cidon, Suman Jana.
Blog Post: https://surrealyz.medium.com/robust-trees-for-security-577061177320
We utilize security domain knowledge to increase the evasion cost against security classifiers, specifically, tree ensemble models that are widely used by security tasks. We propose a new cost modeling method to capture the domain knowledge of features as constraint, and then we integrate the cost-driven constraint into the node construction process to train robust tree ensembles. During the training process, we use the constraint to find data points that are likely to be perturbed given the costs of the features, and we optimize the quality of the trees using a new robust training algorithm. Our cost-aware training method can be applied to different types of tree ensembles, including random forest model (scikit-learn) and gradient boosted decision trees (XGBoost).
- Clone our dev version of scikit-learn
- Check out the robust branch
- We recommend using a virtualenv to install this
- After activating your virtualenv, install the required packages
pip install numpy scipy joblib threadpoolctl Cython
- Then install sklearn with our robust training algorithm
python setup.py install
- Run
data/download_data.sh
under the current repo (source) - Example usage
python train_rf_one.py --train data/binary_mnist0 --test data/binary_mnist0.t -m models/rf/greedy/sklearn_greedy_binary_mnist.pickle -b -z -n 784 -r -s robust -e 0.3 -c gini --nt 1000 -d 6
- Clone our dev version of XGBoost RobustTrees
- Check out the greedy branch
- Run
build.sh
gunzip
all the*.csv.gz
files underRobustTrees/data
to obtain the csv datasets. Reading libsvm sometimes has issues in that version of XGBoost, so we converted the dataset to csv files.- Example usage
./xgboost data/breast_cancer.greedy.conf
We evaluated our core training algorithm without cost constraints over four benchmark datasets, see the table below.
Dataset | Train set size | Test set size | Majority class in train, test (%) | # of features |
---|---|---|---|---|
breast-cancer | 546 | 137 | 62.64, 74.45 | 10 |
cod-rna | 59,535 | 271,617 | 66.67, 66.67 | 8 |
ijcnn1 | 49,990 | 91,701 | 90.29, 90.50 | 22 |
MNIST 2 vs. 6 | 11,876 | 1,990 | 50.17, 51.86 | 784 |
We have also evaluated our cost-aware training algorihtm over a Twitter spam detection dataset used in the paper "A Domain-Agnostic Approach to Spam-URL Detection via Redirects". We re-extracted 25 features (see Table 7 in our paper) as the Twitter spam detection dataset.
Twitter spam dataset | Training | Testing |
---|---|---|
Malicious | 130,794 | 55,732 |
Benign | 165,076 | 71,070 |
Total | 295,870 | 126,802 |
Both datasets are available in data/
, and the files need to be uncompressed.
Please also run cd data/; ./download_data.sh
to get libsvm files under data/
directory, since some of our Python scripts read the libsvm data.
- Regular training, natural model in the paper:
models/gbdt/nature_*.bin
- Chen's robust training algorithm, Chen's model in the paper:
models/gbdt/robust_*.bin
- Our training algorithm, ours model in the paper:
models/gbdt/greedy_*.bin
- Performance: To evaluate model accuracy, false positive rate, AUC, and plot the ROC curves, please run the following commands:
python scripts/xgboost_roc_plots.py breast_cancer
python scripts/xgboost_roc_plots.py ijcnn
python scripts/xgboost_roc_plots.py cod-rna
python scripts/xgboost_roc_plots.py binary_mnist
- The model performance numbers correspond to Table 3, and the generated plots in
roc_plots/
correspond to Figure 7 in the paper.
- Robustness: To evaluate the robustness of models, we use the MILP attack:
xgbKantchelianAttack.py
. It uses Gurobi solver, so you need to obtain a licence from Gurobi to use it. They provide free academic license.mkdir logs
mkdir -p adv_examples/gbdt
mkdir -p result/gbdt
- breast_cancer:
for mtype in $(echo 'nature' 'robust' 'greedy'); do dt='breast_cancer'; python xgbKantchelianAttack.py --data 'data/breast_cancer_scale0.test' --model_type 'xgboost' --model "models/gbdt/${mtype}_${dt}.bin" --rand --num_classes 2 --nfeat 10 --feature_start 1 --both --maxone -n 100 --out "result/gbdt/${mtype}_${dt}.txt" --adv "adv_examples/gbdt/${mtype}_${dt}_adv.pickle" >! logs/milp_gbdt_${mtype}_${dt}.log 2>&1&; done
- cod-rna
for md in $(echo 'nature_cod-rna' 'robust_cod-rna' 'greedy_cod-rna_center_eps0.03'); do python xgbKantchelianAttack.py --data 'data/cod-rna_s.t' --model_type 'xgboost' --model "models/gbdt/${md}.bin" --rand --num_classes 2 --nfeat 8 --feature_start 0 --both --maxone -n 5000 --out "result/gbdt/${md}.txt" --adv "adv_examples/gbdt/${md}_adv.pickle" >! logs/milp_gbdt_${md}.log 2>&1&; done
- ijcnn:
for md in $(echo 'nature_ijcnn' 'robust_ijcnn' 'greedy_ijcnn_center_eps0.02_nr60_md8'); do python xgbKantchelianAttack.py --data 'data/ijcnn1s0.t' --model_type 'xgboost' --model "models/gbdt/${md}.bin" --rand --num_classes 2 --nfeat 22 --feature_start 1 --both --maxone -n 100 --out "result/gbdt/${md}.txt" --adv "adv_examples/gbdt/${md}_adv.pickle" >! logs/milp_gbdt_${md}.log 2>&1&; done
- binary_mnist:
for md in $(echo 'nature_binary_mnist' 'robust_binary_mnist' 'greedy_binary_mnist'); do python xgbKantchelianAttack.py -n 100 --data 'data/binary_mnist_round6.test.csv' --model_type 'xgboost' --model "models/gbdt/${md}.bin" --rand --num_classes 2 --nfeat 784 --both --maxone --feature_start 0 --out "result/gbdt/${md}.txt" --adv "adv_examples/gbdt/${md}.pickle" >! logs/milp_gbdt_${md}.log 2>&1&; done
In the cloned greedy branch of RobustTrees repo, after building the xgboost
binary file, the following commands train the natural, Chen's, and Ours models respectively:
./xgboost data/breast_cancer.unrob.conf
./xgboost data/breast_cancer.conf
./xgboost data/breast_cancer.greedy.conf
For cod-rna, ijcnn, and binary_mnist datasets, the commands follow the same style, ./xgboost data/${dataset}.unrob.conf
, ./xgboost data/${dataset}.conf
, and ./xgboost data/${dataset}.greedy.conf
.
- Regular training, natural model in the paper:
models/rf/*best*.bin
- Chen's robust training algorithm, Chen's model in the paper:
models/gbdt/*heuristic*.bin
- Our training algorithm, ours model in the paper:
models/gbdt/*robust*.bin
-
Performance: To evaluate model accuracy, false positive rate, AUC, and plot the roc curve figures, please run the following commands:
-
python scripts/sklearn_roc_scripts.py breast_cancer
-
python scripts/sklearn_roc_scripts.py ijcnn
-
python scripts/sklearn_roc_scripts.py cod-rna
-
python scripts/sklearn_roc_scripts.py binary_mnist
-
The model performance numbers correspond to Table 4, and the generated plots in
roc_plots/
correspond to Figure 9 in the paper.
-
-
Robustness: To evaluate the robustness of models, we use the MILP attack:
xgbKantchelianAttack.py
. It uses Gurobi solver, so you need to obtain a licence from Gurobi to use it. They provide free academic license.mkdir logs
mkdir -p result/sk-rf
- use
attack_sklearn_selected_RF.py
The script train_all_sklearn.py
trains all models, where the splitter choice best
is natural, heuristic
is Chen's, and robust
is ours. You can modify the loop or use train_rf_one.py
to train an individual model.
models/gbdt/twitter/
For example, to train model M19, run this in RobustTrees: ./xgboost data/twitter_spam.greedy.flex.conf
-
Performance:
scripts/model_accuracy.py
-
Robustness: The following are examples to run the six attacks against the
twitter_spam_nature
model. Change the model name accordingly for other models.- l_1:
md='twitter_spam_nature'; r='_l1'; python xgbKantchelianAttack.py -n 100 --order 1 --data 'data/500_malicious.libsvm' --model_type 'xgboost' --model "models/twitter/${md}.bin" --num_classes 2 --nfeat 25 --maxone --feature_start 0 --out "result/gbdt/adap_${md}${r}.txt" --adv "adv_examples/gbdt/${md}${r}.pickle" >! logs/milp_gbdt_adap_${md}${r}.log &
- l_2:
md='twitter_spam_nature'; r='_l2'; python xgbKantchelianAttack.py -n 100 --order 2 --data 'data/500_malicious.libsvm' --model_type 'xgboost' --model "models/twitter/${md}.bin" --num_classes 2 --nfeat 25 --maxone --feature_start 0 --out "result/gbdt/adap_${md}${r}.txt" --adv "adv_examples/gbdt/${md}${r}.pickle" >! logs/milp_gbdt_adap_${md}${r}.log &
- cost_1:
o='obj1';b='bound_obj1'; md='twitter_spam_nature'; python flexible_xgbKantchelianAttack_cost.py -n 100 --weight config/weight/${o}.json -b "config/eps/${b}.json" --data 'data/500_malicious.libsvm' --model_type 'xgboost' --model "models/twitter/${md}.bin" --num_classes 2 --feature_start 0 --out "result/gbdt/adap_${md}_${o}.txt" --adv "adv_examples/gbdt/${md}_${o}.pickle" >! logs/milp_gbdt_adap_${md}_${o}.log &
- cost_2:
o='obj2'; md='twitter_spam_nature'; python flexible_xgbKantchelianAttack_cost.py -n 100 --weight "config/weight/${o}.json" --data 'data/500_malicious.libsvm' --model_type 'xgboost' --model "models/twitter/${md}.bin" --num_classes 2 --feature_start 0 --out "result/gbdt/adap_${md}_${o}.txt" --adv "adv_examples/gbdt/${md}_${o}.pickle" >! logs/milp_gbdt_adap_${md}_${o}.log &
- cost_3:
o='obj3';b='bound_obj3'; md='twitter_spam_nature'; python flexible_xgbKantchelianAttack_cost.py -n 100 --weight config/weight/${o}.json -b "config/eps/${b}.json" --data 'data/500_malicious.libsvm' --model_type 'xgboost' --model "models/twitter/${md}.bin" --num_classes 2 --feature_start 0 --out "result/gbdt/adap_${md}_${o}.txt" --adv "adv_examples/gbdt/${md}_${o}.pickle" >! logs/milp_gbdt_adap_${md}_${o}.log &
- cost_4:
o='obj4';b='bound_obj4'; md='twitter_spam_nature'; python flexible_xgbKantchelianAttack_cost.py -n 100 --weight config/weight/${o}.json -b "config/eps/${b}.json" --data 'data/500_malicious.libsvm' --model_type 'xgboost' --model "models/twitter/${md}.bin" --num_classes 2 --feature_start 0 --out "result/gbdt/adap_${md}_${o}.txt" --adv "adv_examples/gbdt/${md}_${o}.pickle" >! logs/milp_gbdt_adap_${md}_${o}.log &
- l_1: