Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kdd2020 tutorial updated #1208

Merged
merged 33 commits into from
Sep 25, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
ffbce15
add kdd2020 tutorials for knowledge-aware recommendations
Leavingseason Jul 25, 2020
141eb91
v0: ready for running
Leavingseason Jul 25, 2020
184d289
add environment config files
Leavingseason Jul 25, 2020
8f37eb8
text changes
Leavingseason Jul 25, 2020
70f0c47
update notebook step1
Leavingseason Jul 25, 2020
eacac58
update notebook step2
Leavingseason Jul 25, 2020
9db5623
update notebook step3
Leavingseason Jul 25, 2020
a38528d
update notebook steps
Leavingseason Jul 27, 2020
aa6d9d9
add README
yueguoguo Jul 27, 2020
1949734
update readme
yueguoguo Jul 27, 2020
6238d41
Merge pull request #1164 from microsoft/le/kdd_tutorial
Leavingseason Jul 27, 2020
171d244
update notebooks; move functions to utils
Leavingseason Jul 27, 2020
681239e
update notebook step 3
Leavingseason Jul 27, 2020
c101ad7
update step1 and step5
Leavingseason Jul 31, 2020
5918168
fix LightGCN bug and update step2 step5
Leavingseason Jul 31, 2020
d840596
add reco_gpu_kdd.yaml
Leavingseason Jul 31, 2020
d7c0c0e
delete unused folder; add cpu yaml
Leavingseason Aug 24, 2020
1b40882
update reco_cpu_kdd.yaml
Leavingseason Aug 24, 2020
a2679a6
update yaml config: remove pytorch and fastai
Leavingseason Aug 24, 2020
950dfd8
Update README.md
Leavingseason Aug 25, 2020
a9aa7ed
add scripts for subgraph analysis
Leavingseason Aug 25, 2020
cc9c645
Update reco_gpu_kdd.yaml
miguelgfierro Aug 25, 2020
03d3b19
Merge branch 'staging' into kdd2020_tutorial
Leavingseason Sep 19, 2020
283a3bd
Merge branch 'staging' into kdd2020_tutorial
Leavingseason Sep 24, 2020
e884a69
update yaml
Leavingseason Sep 24, 2020
d854c39
Adjust structure; update comments
Leavingseason Sep 25, 2020
df9d996
add test cases
Leavingseason Sep 25, 2020
9394ede
add gensim to yaml env config
Leavingseason Sep 25, 2020
464f5fb
add liscense info
Leavingseason Sep 25, 2020
b55f3d3
move the tutorial to examples/07_tutorials
Leavingseason Sep 25, 2020
7058113
add yaml and sh files
Leavingseason Sep 25, 2020
e13cf67
update step4
Leavingseason Sep 25, 2020
2d7249d
update README
Leavingseason Sep 25, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,14 @@ ml-20m/
*.model
*.mml
nohup.out

##### kdd 2020 tutorial data folder
scenarios/KDD2020-tutorial/data_folder/
scenarios/academic/KDD2020-tutorial/data_folder/
examples/07_tutorials/KDD2020-tutorial/data_folder/

*.vec
*.tsv
*.sh

tests/resources/
Leavingseason marked this conversation as resolved.
Show resolved Hide resolved
11 changes: 9 additions & 2 deletions examples/00_quick_start/dkn_MIND.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -390,12 +390,19 @@
"\\[3\\] Wu, Fangzhao, et al. \"MIND: A Large-scale Dataset for News Recommendation\" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>\n",
"\\[4\\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python (reco_gpu)",
"display_name": "reco_gpu",
"language": "python",
"name": "reco_gpu"
},
Expand All @@ -409,7 +416,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
"version": "3.6.8"
},
"pycharm": {
"stem_cell": {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -798,12 +798,19 @@
"\n",
"2. LightGCN implementation [TensorFlow]: https://github.com/kuandeng/lightgcn"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3.5",
"language": "python",
"name": "python3"
},
Expand All @@ -817,7 +824,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.5.6"
}
},
"nbformat": 4,
Expand Down
46 changes: 46 additions & 0 deletions examples/07_tutorials/KDD2020-tutorial/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Environment setup
The following setup instructions assume users work in a Linux system. The testing was performed on a Ubuntu Linux system.
We use Conda to install packages and manage the virtual environment. Type ``` conda list ``` to check if you have conda in your machine. If not, please follow the instructions on https://conda.io/projects/conda/en/latest/user-guide/install/linux.html to install either Miniconda or Anaconda (preferred) before we proceed.

1. Clone the repository
```bash
git clone https://github.com/microsoft/recommenders
```

1. Navigate to the tutorial folder. The materials for the tutorial are located under the directory of `recommenders/examples/07_tutorials/KDD2020-tutorial`.
```bash
cd recommenders/examples/07_tutorials/KDD2020-tutorial
```
1. Download the dataset
1. Download the dataset for hands on experiments and unzip to data_folder:
```bash
wget https://recodatasets.blob.core.windows.net/kdd2020/data_folder.zip
unzip data_folder.zip -d data_folder
```
After you unzip the file, there are two folders under data_folder, i.e. 'raw' and 'my_cached'. 'raw' folder contains original txt files from the COVID MAG dataset. 'my_cached' folder contains processed data files, if you miss some steps during the hands-on tutorial, you can make it up by copying corresponding files into experiment folders.
1. Install the dependencies
1. The model pre-training will use a tool for converting the original data into embeddings. Use of the tool will require `g++`. The following installs `g++` on a Linux system.
```bash
sudo apt-get install g++
```
1. The Python script will be run in a conda environment where the dependencies are installed. This can be done by using the `reco_gpu_kdd.yaml` file provided in the branch subfolder with the following commands.
```bash
conda env create -n kdd_tutorial_2020 -f reco_gpu_kdd.yaml
conda activate kdd_tutorial_2020
```
1. The tutorial will be conducated by using the Jupyter notebooks. The newly created conda kernel can be registered with the Jupyter notebook server
```bash
python -m ipykernel install --user --name kdd_tutorial_2020 --display-name "Python (kdd tutorial)"
```

# Tutorial notebooks/scripts
After the setup, the users should be able to launch the notebooks locally with the command
```bash
jupyter notebook --port=8080
```
Then the notebook can be spinned off in a browser at the address of `localhost:8080`.
Alternatively, if the jupyter notebook server is on a remote server, the users can launch the jupyter notebook by using the following command.
```bash
jupyter notebook --no-browser --ip=10.214.70.89 --port=8080
```
From the local browser, the notebook can be spinned off at the address of `10.214.70.89:8080`.
61 changes: 61 additions & 0 deletions examples/07_tutorials/KDD2020-tutorial/dkn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
data:
doc_size: 15 # Each feature length should be fixed at doc_size, if the number of words in document is more than doc_size, you should truncate the document to doc_size words, and if the number of words in document is less than doc_size, you should padding 0.
his_size: 20 # Max number of user click history, we will automatically keep the last his_size number of user click history, if users' click history is more than his_size, and we will automatically padding 0 if less than his_size.
word_size: 194755 # word vocabulary size
entity_size: 57267 # entity vocabulary size
data_format: dkn

info:
metrics:
- auc
pairwise_metrics:
- group_auc
- mean_mrr
- ndcg@2;4;6
show_step: 10000 # print loss every show_step batches

model:
method : classification
activation:
- sigmoid
attention_activation: relu
attention_dropout: 0.0
attention_layer_sizes: 32
dim: 32 # word embedding dim
use_entity: true # use entity embedding
use_context: true # use context embedding

entity_dim: 32 # entity embedding dim
entity_embedding_method: TransE
transform: true # add a transform layer for entity and context embeddings

dropout:
- 0.0
filter_sizes: # window size of kcnn filters
- 1
- 2
- 3
layer_sizes: # layer size for final prediction score layer
- 300
# model_type: DKN_without_context
model_type: dkn
num_filters: 50 # number of filter for each filter_size in kcnn part
infer_model_name : epoch_2

train:
batch_size: 100
embed_l1: 0.000
embed_l2: 0.000001
epochs: 50
init_method: uniform
init_value: 0.01
layer_l1: 0.000
layer_l2: 0.000001
learning_rate: 0.00005
loss: log_loss
optimizer: adam
save_model: True
save_epoch : 1 # save model every save_epoch epochs
enable_BN : False
is_clip_norm: False
max_grad_norm: 0.5
22 changes: 22 additions & 0 deletions examples/07_tutorials/KDD2020-tutorial/lightgcn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#model
model:
model_type : "lightgcn"
embed_size : 64 # the embedding dimension of users and items
n_layers : 3 # number of layers of the model

#train
train:
batch_size : 1024
decay : 0.0001 # l2 regularization for embedding parameters
epochs : 1000 # number of epochs for training
learning_rate : 0.001
eval_epoch : -1 # if it is not -1, evaluate the model every eval_epoch; -1 means that evaluation will not be performed during training
top_k : 20 # number of items to recommend when calculating evaluation metrics

#show info
#metric : "recall", "ndcg", "precision", "map"
info:
save_model : True # whether to save model
save_epoch : 1 # if save_model is set to True, save the model every save_epoch
metrics : ["recall", "ndcg", "precision", "map"] # metrics for evaluation
MODEL_DIR : ./tests/resources/deeprec/lightgcn/model/lightgcn_model/ # directory of saved models
Loading