A super-fast lookup service for canonical names based on redis and configurable fallback upstream sources (currently Aleph and Wikipedia).
juditha
wants to solve the noise/garbage problem occurring when working with Named Entity Recognition. Given the availability of huge lists of known names, such as company registries or lists of persons of interest, one could canonize ner
-results against this service to check if they are known.
The implementation uses a pre-populated redis cache which can fallback to other sources.
pip install juditha
docker run -p 6379:6379 redis
echo "Jane Doe\nAlice" | juditha load
juditha lookup "jane doe"
"Jane Doe"
To match more fuzzy, reduce the threshold (default 0.97):
juditha lookup "doe, jane" --threshold 0.5
"Jane Doe"
cat entities.ftm.json | juditha load --from-entities
juditha load -i s3://my_bucket/names.txt
juditha load -i https://data.ftm.store/eu_authorities/entities.ftm.json --from-entities
Following the nomenklatura
specification, a dataset json config needs names.txt
or entities.ftm.json
in its resources.
juditha load-dataset https://data.ftm.store/eu_authorities/index.json
juditha load-catalog https://data.ftm.store/investigraph/catalog.json
from juditha import lookup
assert lookup("jane doe") == "Jane Doe"
assert lookup("doe, jane") is None
assert lookup("doe, jane", threshold=0.5) == "Jane Doe"
uvicorn --port 8000 juditha.api:app --workers 8
Just do head requests to check if a name is known:
curl -I "http://localhost:8000/jane%20doe"
HTTP/1.1 200 OK
curl -I "http://localhost:8000/John"
HTTP/1.1 404 Not Found
An actual request returns the canonized name:
curl "http://localhost:8000/doe,%20jane?threshold=0.5"
Jane Doe
set redis endpoint via environment variable:
REDIS_URL=redis://localhost:6379
Create a yaml
config:
sources:
- klass: aleph
config:
host: https://aleph.investigativedata.org
# api_key: ...
- klass: aleph
config:
host: https://aleph.occrp.org
# api_key: ...
- klass: wikipedia
config:
url: https://de.wikipedia.org
Store this as a file (e.g. config.yml
) and use it via env vars:
JUDITHA_CONFIG=config.yml juditha lookup "Juditha Dommer"
If a lookup is not found in redis, juditha
would use the fallback sources in the given order to lookup names. The results are stored in redis for the next call.
The juditha
client can use a remote api endpoint of a deployed juditha
:
JUDITHA=https://juditha.ftm.store juditha lookup "HIMATIC EXPLOTACIONES SL"
from juditha import Juditha
j = Juditha("https://juditha.ftm.store")
assert j.lookup("HIMATIC EXPLOTACIONES SL") is not None
Juditha Dommer was the daughter of a coppersmith and raised seven children, while her husband Johann Pachelbel wrote a canon.