Pre-process the Language and Knowledge Base data for loading into entity-fishing.
The sub-module nerd-data
pre-processes the Wikidata JSON and Wikipedia XML dumps to create compiled data to be used by entity-fishing, a machine learning tool for extracting and disambiguating Wikidata entities in text and PDF at scale.
The pre-processing is an adaptation of the WikipediaMiner 2.0 for the XML dump processing, which relies on Hadoop. The main Modifications include the usage of the Sweble MediaWiki document parser for Wikipedia pages (the most comprehensive, reliable and fast MediaWiki parser following our tests, apart MediaWiki itself), a complete review of the compiled statistics, processing of Wikidata dump, the usage of LMDB to avoid distributed data, additional extraction related to multilinguality and various speed optimization.
The Wikipedia pre-processing supports current the Wikipedia dumps (2022) and was successfully tested with English, French, German, Italian, Spanish, Arabic, Mandarin, Russian, Japanese, Portuguese and Farsi XML dumps. The Wikipedia XML dumps and additional required files are available at the Wikimedia Downloads page, as well as the Wikidata JSON dump.
entity-fishing needs to be installed first on the system and built, without the knowledge-base and language data:
git clone https://github.com/kermitt2/entity-fishing
cd entity-fishing
./gradlew clean build -x test
The -x test
when building is important to skip tests, because there is no knowledge-base and language resource data available for the tests yet.
Then install and build GRISP:
git clone https://github.com/kermitt2/grisp
cd grisp
mvn clean install
Note: current latest versions of GRISP and entity-fishing are 0.0.6
.
A script is available to:
- download the different resources needed fromn Wikidata and Wikipedia for a set of specified languages
- create cvs translation files between languages
- generate Wikidata property labels for each language
- creating Wikidata knowledge base backbone and language-specific mapping with Wikidata entities
The script has been tested on a Linux setup, but it is likely to work also on MacOS. To run the script:
cd grisp/scripts/
./wikipedia-resources.sh [instal path of GRISP] [storage path of the data resources]
For example:
./wikipedia-resources.sh /home/lopez/grisp/ /media/lopez/data/wikipedia/latest/
The above mentioned steps are realized successively by the scripts. By default all the languages will be covered, but you can change to a subset of languages by modifying the script at the following line:
declare -a languages=("en" "de" "fr" "it" "es" "ar" "zh" "ja" "ru" "pt" "fa" "uk" "sv" "hi" "bn")
Note that English "en"
at least is mandatory to further running entity-fishing.
Be aware that the data path must have enough storage: as of April 2022, 74GB are needed for Wikidata dump and 70GB for all the language resources. To accomodate all resources, including the next Hadoop processing step, consider 200GB for all the languages.
Once all the required resources have been downloaded via the provided script, see above, we can run the pre-processing of the Wikipedia dumps.
The parsing and processing of the Wikipedia XML article dump files is computationally expensive, it has to be parallelized and we are using an Hadoop process for this purpose. A pseudo distributed mode (just running the process on one machine with several CPU) is enough for reasonnable processing time. A "real" distributed mode has not been tested for the moment and is thus currently not supported.
Create the hadoop job jar:
cd grisp/nerd-data
> mvn clean package
Then see instructions under nerd-data/doc/hadoop.md for running the hadoop job and getting csv file results.
This processing is an adaptation and optimization of the WikipediaMiner 2.0 XML dump processing. It enables the support of the latest Wikipedia dump files. The processing is considerably faster than with WikipediaMiner and a single server is enough for processing the lastest XML dumps in a reasonnable time. For December 2016 English Wikipedia XML dump: around 7 hours 30 minutes. For December 2016 French and German Wikipedia XML dump: around 2 hours 30 minutes (in pseudo distributed mode, one server Intel Core i7-4790K CPU 4.00GHz Haswell, 16GB memory, with 4 cores, 8 threads, SSD).
We think that it is possible to still improve significantly the processing time, lower memory consumption, and avoid completely Hadoop - simply by optimizing the processing for a common single multi-thread machine. But given that the current state of the library gives satisfactory performance, we leave these improvements for the future if necessary.
Here how the final data tree should look like from the root directory (for 3 languages, additional languages follow the same pattern), ready to be loaded and further optimized in embedded databases by entity-fishing:
.
├── de
│ ├── articleParents.csv
│ ├── categoryParents.csv
│ ├── childArticles.csv
│ ├── childCategories.csv
│ ├── dewiki-latest-langlinks.sql.gz
│ ├── dewiki-latest-page_props.sql.gz
│ ├── dewiki-latest-pages-articles-multistream.xml.bz2
│ ├── label.csv
│ ├── page.csv
│ ├── pageLabel.csv
│ ├── pageLinkIn.csv
│ ├── pageLinkOut.csv
│ ├── redirectSourcesByTarget.csv
│ ├── redirectTargetsBySource.csv
│ ├── stats.csv
│ ├── translations.csv
│ └── wikidata-properties.json
│ └── wikidata.txt
├── en
│ ├── articleParents.csv
│ ├── categoryParents.csv
│ ├── childArticles.csv
│ ├── childCategories.csv
│ ├── enwiki-latest-langlinks.sql.gz
│ ├── enwiki-latest-page_props.sql.gz
│ ├── enwiki-latest-pages-articles-multistream.xml.bz2
│ ├── label.csv
│ ├── page.csv
│ ├── pageLabel.csv
│ ├── pageLinkIn.csv
│ ├── pageLinkOut.csv
│ ├── redirectSourcesByTarget.csv
│ ├── redirectTargetsBySource.csv
│ ├── stats.csv
│ ├── translations.csv
│ └── wikidata-properties.json
│ └── wikidata.txt
├── fr
│ ├── articleParents.csv
│ ├── categoryParents.csv
│ ├── childArticles.csv
│ ├── childCategories.csv
│ ├── frwiki-latest-langlinks.sql.gz
│ ├── frwiki-latest-page_props.sql.gz
│ ├── frwiki-latest-pages-articles-multistream.xml.bz2
│ ├── label.csv
│ ├── page.csv
│ ├── pageLabel.csv
│ ├── pageLinkIn.csv
│ ├── pageLinkOut.csv
│ ├── redirectSourcesByTarget.csv
│ ├── redirectTargetsBySource.csv
│ ├── stats.csv
│ ├── translations.csv
│ └── wikidata-properties.json
│ └── wikidata.txt
├── wikidata
│ ├── wikidataIds.csv
│ ├── latest-all.json.bz2
Note:
-
it is expected to have 15 files in each language-specific directory, plus 3 Wikipedia dump files (the
.bz2
.gz
files), -
the full Wikipedia article dump for each language must be present in the language-specific directories (e.g.
enwiki-latest-pages-articles-multistream.xml.bz2
orenwiki-latest-pages-articles-multistream.xml.gz
orenwiki-latest-pages-articles-multistream.xml
, they are required to generate definitions for entities, create training data, compute additional entity embeddings) ; the dump file can be compressed inbz2
,gzip
or uncompressed - all these variants should be loaded appropriately by entity-fishing, -
the wikidata identifiers csv file
wikidataIds.csv
and the full wikidata JSON dump filelatest-all.json.bz2
are under awikidata
sub-directory while the wikidata language-specific Wikidata mapping fileswikidata.txt
andwikidata-properties.json
are installed in each language-specific sub-directory, -
in entity-fishing the loading of these files is automatic when building the project or starting the service (if not present), be sure to indicate the path to these above generated files in the entity-fishing config files.
We considering generating more KB data to be mapped: geonames, geospecies, etc. and better exploiting Wikidata labels and statements.
Many thanks to David Milne for the Wikipedia XML dump processing. The present pre-processing of the Wikipedia data is originally a fork of a part of his project.
GRISP is distributed under GPL 3.0 license.
Contact: Patrice Lopez ([email protected])