This folder contains the scripts to preprocess the raw datasets and, once they have been preprocessed, to generate the input data for each reproduced algorithm.
The two scripts preprocess_ml1m
and preprocess_lastfm1K
to preprocess the raw datasets are similar and the proposed
commands are the ones to perfectly re-create the datasets that we used in our experiments. In particular, the dataset
will be splitted in train, validation, test splits with ratio 7:1:2. If a reproduced algorithm does not use a validation
set, then the train set becomes the concatenation of train and validation, but this operation will be performed by the
other script in this folder.
The preprocessing of MovieLens 1M is performed using the file preprocess_ml1m
, and the following command should be executed
from the root of the repository:
python -m Preprocessing.preprocess_ml1m -dataset movielens --dataset_size 1m
--subdatasets ratings --dataset_split train --users_field user_id --items_field artist_id
--dataset_columns user_id movie_id user_rating user_gender bucketized_user_age --rating_field user_rating
--sensitive_field user_gender --train_val_test_split_type per_user_timestamp --train_val_test_split 70% 10% 20%
--attribute_to_binary bucketized_user_age --binary_le_delimiter 34 -model Pointwise --n_reps 2
--overwrite_preprocessed_dataset
The preprocessing of Last.FM 1K is performed using the file preprocess_lastfm1K
, and the following command should be
executed from the root of the repository:
python -m Preprocessing.preprocess_lastfm1K -dataset lastfm --dataset_size 1K
--subdatasets plays users --dataset_split train train --users_field user_id --items_field artist_id
--dataset_columns user_id artist_id plays - user_id user_gender user_age --rating_field user_rating
--sensitive_field user_gender --train_val_test_split_type per_user_random --train_val_test_split 70% 10% 20%
--min_interactions 20 --attribute_to_binary user_age --binary_le_delimiter 24 -model Pointwise --n_reps 2
--overwrite_preprocessed_dataset
Last.FM 1K is not part of TensorFlow Datasets catalog, so if TensorFlow Datasets library does not do it automatically, it could be necessary to execute the following command inside the directory data/datasets/lastfm:
tfds build
which will download the dataset from the original website and prepare it for preprocessing.
The commands are not written in one line to better visualize them, but the "newline" at the end of each line should be deleted and the command should be executed copying it "one-line" form.
Once the datasets have been preprocessed, you can create the input data necessary for each reproduced algorithm. The input
data will be saved in a folder with the same name of the folder inside reproducibility_study that contains the related
algorithm of the reproduced paper. Before executing this script some variables could be necessary to be changed because
of different filenames. This is the case of the variable kwargs
at the line 62 for MovieLens 1M and line 76 for Last.FM 1K:
model_paths
: this could be necessary to be modified depending on the filename of the predictions on the test set for each model. So it could be possible that you need to modify, for instance, the path of the predictions on the test set for PMF at the key'PMF'
of this internal dictionary using the value of your own predictions for the same model for MovieLens 1M (kwargs
at line 62) or Last.FM 1K (kwargs
ar line 76)test_path
: the path of the test set used to evaluate the models of NLR codebase of Li et al (this should not change if you do not rename the file name of the generated test set)
If all the paths are correct you can generate the input data by executing the following commands from the root of the repository:
MovieLens 1M Gender
python -m Preprocessing.generate_input_data -dataset movielens
--dataset_size 1m -model Pointwise --n_reps 2 --train_val_test_split_type per_user_timestamp
--train_val_test_split 70% 10% 20% --users_field user_id --items_field movie_id --rating_field user_rating
--sensitive_field user_gender
MovieLens 1M Age
python -m Preprocessing.generate_input_data -dataset movielens
--dataset_size 1m -model Pointwise --n_reps 2 --train_val_test_split_type per_user_timestamp
--train_val_test_split 70% 10% 20% --users_field user_id --items_field movie_id --rating_field user_rating
--sensitive_field bucketized_user_age
Last.FM 1K Gender
python -m Preprocessing.generate_input_data -dataset lastfm --dataset_size 1K
-model Pointwise --n_reps 2 --min_interactions 20 --train_val_test_split_type per_user_random
--train_val_test_split 70% 10% 20% --users_field user_id --items_field artist_id --rating_field user_rating
--sensitive_field user_gender
Last.FM 1K Age
python -m Preprocessing.generate_input_data -dataset lastfm --dataset_size 1K
-model Pointwise --n_reps 2 --min_interactions 20 --train_val_test_split_type per_user_random
--train_val_test_split 70% 10% 20% --users_field user_id --items_field artist_id --rating_field user_rating
--sensitive_field user_age
Here we explain some of the parameters passed to this script to better understand their behaviour:
-dataset
: it is the name of the dataset. It can be one of [movielens
,lastfm
].--dataset_size
: it is the size of the dataset. For MovieLens it could be100k
or1m
. For Last.FM it could be1K
or360K
--sensitive_field
: it is the sensitive field of the dataset that will be used to create files that contain sensitive information and that are necessary for some of the reproduced algorithms.-model
: it isPointwise
becauseNLR
codebase of Li et al use pointwise data to train their models. The parameter--n_reps
specify the number of times the train data should be repeated with negative labels. Setting it to 2 will generate 1 negative sample for each interaction in the train set (it will be performed also for the validation set for theNLR
codebase of Li et al)
However, these commands are not yet complete because we still need to specify the data of which reproduced paper we want
to generate. This can be done adding one or more of the following parameters that can be also found inside the function
_add_other_features_args
of the Argument Parser in helpers/recsys_arg_parser.py:
--create_user_oriented_fairness_files_input_data
: it creates the input data for the fairness-aware post-processing method of Li et al, which will be saved in the folder reproducibility_study/Preprocessing/input_data/DATASET/Li et al--create_nlr_input_data
: it creates the input data for theNLR
codebase of Li et al, which will be saved in the folder reproducibility_study/Preprocessing/input_data/DATASET/Li et al/NLR--create_co_clustering_for_fair_input_data
: it creates the input data for the reproduced algorithm of Frisch et al, which will be saved in the folder reproducibility_study/Preprocessing/input_data/DATASET/Frisch et al--create_all_the_cool_kids_input_data
: it creates the input data for the reproduced algorithm of Ekstrand et al, which will be saved in the folder reproducibility_study/Preprocessing/input_data/DATASET/Ekstrand et al--create_librec_auto_input_data
: it creates the input data for the reproduced algorithm of Burke et al, which will be saved in the folder reproducibility_study/Preprocessing/input_data/DATASET/Burke et al
The term DATASET inside the paths will be one of [movielens_1m
, filtered(20)_lastfm_1K
].
As a demonstration, to create the input data for Ekstrand et al and Burke et al for Last.FM 1K for gender groups the following command must be executed from the root of the repository:
python -m Preprocessing.generate_input_data -dataset lastfm --dataset_size 1K
-model Pointwise --n_reps 2 --min_interactions 20 --train_val_test_split_type per_user_random
--train_val_test_split 70% 10% 20% --users_field user_id --items_field artist_id --rating_field user_rating
--sensitive_field user_gender --create_all_the_cool_kids_input_data --create_librec_auto_input_data