Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kdd2020 tutorial updated #1208

Merged
merged 33 commits into from
Sep 25, 2020
Merged
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
ffbce15
add kdd2020 tutorials for knowledge-aware recommendations
Leavingseason Jul 25, 2020
141eb91
v0: ready for running
Leavingseason Jul 25, 2020
184d289
add environment config files
Leavingseason Jul 25, 2020
8f37eb8
text changes
Leavingseason Jul 25, 2020
70f0c47
update notebook step1
Leavingseason Jul 25, 2020
eacac58
update notebook step2
Leavingseason Jul 25, 2020
9db5623
update notebook step3
Leavingseason Jul 25, 2020
a38528d
update notebook steps
Leavingseason Jul 27, 2020
aa6d9d9
add README
yueguoguo Jul 27, 2020
1949734
update readme
yueguoguo Jul 27, 2020
6238d41
Merge pull request #1164 from microsoft/le/kdd_tutorial
Leavingseason Jul 27, 2020
171d244
update notebooks; move functions to utils
Leavingseason Jul 27, 2020
681239e
update notebook step 3
Leavingseason Jul 27, 2020
c101ad7
update step1 and step5
Leavingseason Jul 31, 2020
5918168
fix LightGCN bug and update step2 step5
Leavingseason Jul 31, 2020
d840596
add reco_gpu_kdd.yaml
Leavingseason Jul 31, 2020
d7c0c0e
delete unused folder; add cpu yaml
Leavingseason Aug 24, 2020
1b40882
update reco_cpu_kdd.yaml
Leavingseason Aug 24, 2020
a2679a6
update yaml config: remove pytorch and fastai
Leavingseason Aug 24, 2020
950dfd8
Update README.md
Leavingseason Aug 25, 2020
a9aa7ed
add scripts for subgraph analysis
Leavingseason Aug 25, 2020
cc9c645
Update reco_gpu_kdd.yaml
miguelgfierro Aug 25, 2020
03d3b19
Merge branch 'staging' into kdd2020_tutorial
Leavingseason Sep 19, 2020
283a3bd
Merge branch 'staging' into kdd2020_tutorial
Leavingseason Sep 24, 2020
e884a69
update yaml
Leavingseason Sep 24, 2020
d854c39
Adjust structure; update comments
Leavingseason Sep 25, 2020
df9d996
add test cases
Leavingseason Sep 25, 2020
9394ede
add gensim to yaml env config
Leavingseason Sep 25, 2020
464f5fb
add liscense info
Leavingseason Sep 25, 2020
b55f3d3
move the tutorial to examples/07_tutorials
Leavingseason Sep 25, 2020
7058113
add yaml and sh files
Leavingseason Sep 25, 2020
e13cf67
update step4
Leavingseason Sep 25, 2020
2d7249d
update README
Leavingseason Sep 25, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
update notebook step2
  • Loading branch information
Leavingseason committed Jul 25, 2020
commit eacac58f50641b6ea3aed348148d45ab6054d60a
Original file line number Diff line number Diff line change
@@ -803,7 +803,7 @@
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3.5",
"language": "python",
"name": "python3"
},
@@ -817,7 +817,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.5.6"
}
},
"nbformat": 4,
3 changes: 1 addition & 2 deletions scenarios/KDD2020-tutorial/step1_data_preparation.ipynb
Original file line number Diff line number Diff line change
@@ -14,7 +14,7 @@
"metadata": {},
"source": [
"# Data manipulation\n",
"This notebook provides all necessary steps to generate DKN's input dataset from the MAG COVID-19 raw dataset "
"This notebook provides necessary steps to generate DKN's input dataset from the MAG COVID-19 raw dataset "
]
},
{
@@ -356,7 +356,6 @@
}
],
"source": [
"\n",
"split_train_valid_file(\n",
" [Path_paper_pair_cocitation, Path_FirstAuthorPaperPair, Path_paper_pair_coreference],\n",
" OutFile_dir_DKN\n",
80 changes: 28 additions & 52 deletions scenarios/KDD2020-tutorial/step2_pretraining-embeddings.ipynb
Original file line number Diff line number Diff line change
@@ -9,6 +9,14 @@
"<i>Licensed under the MIT License.</i>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pretraining word and entity embeddings\n",
"This notebook trains word embeddings and entity embeddings for DKN initializations."
]
},
{
"cell_type": "code",
"execution_count": null,
@@ -60,6 +68,13 @@
"OutFile_dir_DKN = 'data_folder/my/DKN-training-folder'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use word2vec algorithm implemented in Gensim (https://radimrehurek.com/gensim/models/word2vec.html) to generate word embeddings."
]
},
{
"cell_type": "code",
"execution_count": 19,
@@ -82,21 +97,29 @@
"\n",
" print('start to train word embedding...', end=' ')\n",
" my_sentences = MySentenceCollection(Path_sentences)\n",
" model = Word2Vec(my_sentences, size=32, window=5, min_count=1, workers=4, iter=50)\n",
" model = Word2Vec(my_sentences, size=32, window=5, min_count=1, workers=8, iter=30)\n",
"\n",
" model.save(OutFile_word2vec)\n",
" model.wv.save_word2vec_format(OutFile_word2vec_txt, binary=False)\n",
" print('\\tdone . ')\n",
"\n",
"Path_sentences = os.path.join(InFile_dir, 'sentence.txt')\n",
"# train_word2vec(Path_sentences, OutFile_dir)\n",
"\n",
"t0 = time.time()\n",
"train_word2vec(Path_sentences, OutFile_dir)\n",
"t1 = time.time()\n",
"print('time elapses: {0:.1f}s'.format(t1 - t0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We leverage a graph embedding model to encode entities into embedding vectors.\n",
"<img src=\"https://recodatasets.blob.core.windows.net/kdd2020/images%2Fkg-embedding.JPG\" width=\"600\">\n",
"We use an open-source implementation of TransE (https://github.com/thunlp/Fast-TransX) for generating knowledge graph embeddings:"
]
},
{
"cell_type": "code",
"execution_count": 9,
@@ -122,10 +145,6 @@
}
],
"source": [
"## some step in transE training\n",
"\n",
"## https://github.com/thunlp/Fast-TransX\n",
"\n",
"!bash ./run_transE.sh"
]
},
@@ -137,54 +156,11 @@
"source": []
},
{
"cell_type": "code",
"execution_count": 10,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"def gen_context_embedding(entity_file, context_file, kg_file):\n",
" #load embedding_vec\n",
" entity_index = 0\n",
" entity_dict = {}\n",
" fp_entity = open(entity_file, 'r')\n",
" for line in fp_entity:\n",
" linesplit = line.strip().split('\\t')[:EMBEDDING_LENGTH]\n",
" linesplit = list(map(float, linesplit))\n",
" entity_dict[str(entity_index)] = linesplit\n",
" entity_index += 1\n",
" fp_entity.close()\n",
"\n",
" #build neighbor for entity in entity_dict\n",
" fp_kg = open(kg_file, 'r', encoding='utf-8')\n",
" triple_num = fp_kg.readline()\n",
" triples = fp_kg.readlines()\n",
" kg_neighbor_dict = {}\n",
" for triple in triples:\n",
" linesplit = triple.strip().split(' ')\n",
" head = linesplit[0]\n",
" tail = linesplit[1]\n",
" if head not in kg_neighbor_dict:\n",
" kg_neighbor_dict[head] = set()\n",
" kg_neighbor_dict[head].add(tail)\n",
"\n",
" if tail not in kg_neighbor_dict:\n",
" kg_neighbor_dict[tail] = set()\n",
" kg_neighbor_dict[tail].add(head) \n",
" fp_kg.close()\n",
"\n",
" context_embeddings = np.zeros([entity_index , EMBEDDING_LENGTH])\n",
"\n",
" for entity in entity_dict:\n",
" if entity in kg_neighbor_dict:\n",
" context_entity = kg_neighbor_dict[entity]\n",
" context_vecs = []\n",
" for c_entity in context_entity:\n",
" context_vecs.append(entity_dict[c_entity])\n",
"\n",
" context_vec = np.mean(np.asarray(context_vecs), axis=0)\n",
" context_embeddings[int(entity)] = context_vec\n",
"\n",
" np.savetxt(context_file, context_embeddings, delimiter='\\t')"
"DKN take considerations of both the entity embeddings and its context embeddings.\n",
"<img src=\"https://recodatasets.blob.core.windows.net/kdd2020/images/context-embedding.JPG\" width=\"600\">"
]
},
{
45 changes: 44 additions & 1 deletion scenarios/KDD2020-tutorial/utils/task_helper.py
Original file line number Diff line number Diff line change
@@ -509,7 +509,50 @@ def format_word_embeddings(word_vecfile, word2id_file, np_file):
with open(np_file, 'wb') as f:
np.save(f, word_embeddings)


def gen_context_embedding(entity_file, context_file, kg_file):
#load embedding_vec
entity_index = 0
entity_dict = {}
fp_entity = open(entity_file, 'r')
for line in fp_entity:
linesplit = line.strip().split('\t')[:EMBEDDING_LENGTH]
linesplit = list(map(float, linesplit))
entity_dict[str(entity_index)] = linesplit
entity_index += 1
fp_entity.close()

#build neighbor for entity in entity_dict
fp_kg = open(kg_file, 'r', encoding='utf-8')
triple_num = fp_kg.readline()
triples = fp_kg.readlines()
kg_neighbor_dict = {}
for triple in triples:
linesplit = triple.strip().split(' ')
head = linesplit[0]
tail = linesplit[1]
if head not in kg_neighbor_dict:
kg_neighbor_dict[head] = set()
kg_neighbor_dict[head].add(tail)

if tail not in kg_neighbor_dict:
kg_neighbor_dict[tail] = set()
kg_neighbor_dict[tail].add(head)
fp_kg.close()

context_embeddings = np.zeros([entity_index , EMBEDDING_LENGTH])

for entity in entity_dict:
if entity in kg_neighbor_dict:
context_entity = kg_neighbor_dict[entity]
context_vecs = []
for c_entity in context_entity:
context_vecs.append(entity_dict[c_entity])

context_vec = np.mean(np.asarray(context_vecs), axis=0)
context_embeddings[int(entity)] = context_vec

np.savetxt(context_file, context_embeddings, delimiter='\t')


######## data preparation for lightGCN
def load_instance_file(