GitHub - shyamupa/xling-el: pytorch model for cross-lingual entity linking.

Code for running the entity linking model. This is part of the code for the xelms project.

Requirements

pytorch (0.2.0+21f8ad4): installed from source, and patched for sparse tensor operations (instructions below).
python3.
cogcomp-nlpy.
Download the resources and trained models here and place them in the folder xling-el/data. Right now, pre-trained models are available for German, Spanish, French, Italian, and Chinese.

Resources for Candidate Generation

First set up candidate generation and other resources as described in projects wikidump_preprocessing and wiki_candgen.
A mongo daemon needs to be running. This is where the resources generated in wiki_candgen will be kept for fast (and parallel) access.

Note: These resources are provided in the resources directory downloaded in step 4. above, so ideally you do not need to regenerate them, unless you plan to use a newer Wikipedia dump or a larger knowledge base.

Patching Pytorch for Sparse Tensor Operations

This is best done in a new conda environment.

First checkout the sparse_patch branch from this repository.

git clone https://github.com/shyamupa/pytorch
cd pytorch
git checkout sparse_patch

Install the patched code from source using the following commands,

export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]

# Install basic dependencies
conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing
conda install -c mingfeima mkldnn
cd pytorch_patched
python setup.py install

Ensure that the patched pytorch was successfully installed,

>>> import torch
>>> torch.__version__
'0.2.0+43662e7'

Mention Detection using NER

For German, Spanish, French and Italian, download relevant Spacy NER Models

pip install spacy
python -m spacy download de_core_news_sm
python -m spacy download es_core_news_md
python -m spacy download fr_core_news_md
python -m spacy download it_core_news_sm

For Chinese, download stanford corenlp jar and the chinese model jar and place them in a stanford_jars directory.

$ ls stanford_jars/
stanford-corenlp-full-2018-10-05
$ ls stanford_jars/stanford-corenlp-full-2018-10-05
...
...
stanford-chinese-corenlp-2018-10-05-models.jar
...

And set the bash environment variable CORENLP_HOME to path/to/stanford_jars/stanford-corenlp-full-2018-10-05.

export CORENLP_HOME=path/to/stanford_jars/stanford-corenlp-full-2018-10-05

Running the Model

To run the model, use the command,

./run_inference_on_doc.sh <lang> <infile> <outfile>

For instance, for running on a German document test_docs/de_doc.txt, one would run

./run_inference_on_doc.sh de test_docs/de_doc.txt test_docs/de_doc_output.txt

The json output will be produced in test_docs/de_doc_output.txt.

Output

The output file is a json serialized text annotation, with a view named NEURAL_XEL_<lang>. The view consists of a list of the constituents that have been linked to a Wikipedia title. Below is the output for the German test document provided in the repo,

...
"viewName": "NEURAL_XEL_de",
...
...
"constituents": [
      {
       "end": 2,
       "label": "en.wikipedia.org/wiki/Angela_Merkel",
       "score": 0.5128146075318596,
       "start": 0,
       "tokens": "Angela Merkel"
      },
      {
       "end": 5,
       "label": "NULLTITLE",
       "score": 0.05000000074505806,
       "start": 4,
       "tokens": "Elim-Krankenhaus"
      },
      ...

The label field for each constituent is the predicted Wikipedia entity for the span identified by the start and end token index. Here a label of NULLTITLE means that the named entity detected by the mention detection system could not be linked to any entity.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
datastructs		datastructs
mention_detection		mention_detection
model		model
readers		readers
test_docs		test_docs
utils		utils
wiki_kb		wiki_kb
.gitignore		.gitignore
README.md		README.md
file_convertor.py		file_convertor.py
main.py		main.py
run_inference_on_doc.sh		run_inference_on_doc.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Requirements

Resources for Candidate Generation

Patching Pytorch for Sparse Tensor Operations

Mention Detection using NER

Running the Model

Output

About

Releases

Packages

Languages

shyamupa/xling-el

Folders and files

Latest commit

History

Repository files navigation

Requirements

Resources for Candidate Generation

Patching Pytorch for Sparse Tensor Operations

Mention Detection using NER

Running the Model

Output

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages