Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge v3.0.0 into master #6

Merged
merged 84 commits into from
Sep 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
e116857
Treat topic matches of two subwords within a word as word-level rathe…
richardpaulhudson Feb 3, 2021
9be5069
Frequency factor
richardpaulhudson Feb 5, 2021
1a3c970
different_match_cutoff_score
richardpaulhudson Feb 5, 2021
4f81369
Sorted out German participles
richardpaulhudson Feb 8, 2021
cb2ff9c
Reverse dependency matching
richardpaulhudson Feb 9, 2021
be83572
Fixed bug; new method to get word match by search phrase word index
richardpaulhudson Feb 10, 2021
8868731
Fixed German self-reference bug
richardpaulhudson Feb 11, 2021
f4c63d1
Improved error output
richardpaulhudson Feb 11, 2021
84b5fa1
Fixed bug with wrong subword being considered within structural matches
richardpaulhudson Feb 19, 2021
f73d21a
Upgrade to Spacy 3, test conversion not yet complete
richardpaulhudson May 5, 2021
f849567
Most but not all tests passing again
richardpaulhudson May 6, 2021
7e87078
All pre-existing tests now passing with Spacy 3
richardpaulhudson May 7, 2021
bd86194
Intermediate state during refactoring, all tests passing except Multipr.
richardpaulhudson Jul 7, 2021
d69e7ab
Intermediate state during refactoring, all tests passing except Multipr.
richardpaulhudson Jul 7, 2021
1f529d5
Intermediate state during refactoring, all tests passing except Multipr.
richardpaulhudson Jul 8, 2021
5a370a6
Intermediate state during refactoring, all tests passing except Multipr.
richardpaulhudson Jul 8, 2021
d65db7e
Redesign using workers, seems to work, regression tests do not yet run
richardpaulhudson Jul 9, 2021
6d3fcc7
Minor changes
richardpaulhudson Jul 9, 2021
33a8fe5
Removed ThreadsafeContainer
richardpaulhudson Jul 9, 2021
b6ec111
en and de tests passing again
richardpaulhudson Jul 12, 2021
846c731
manager, serialization and word level matching tests passing
richardpaulhudson Jul 12, 2021
44960fe
Pre-existing tests passing again
richardpaulhudson Jul 13, 2021
16f1374
spaCy 3.1; de tests not passing yet
richardpaulhudson Jul 15, 2021
6eaff2d
spaCy 3.1: all tests now passing
richardpaulhudson Jul 15, 2021
4039b47
Corrected execution order in multithreading test
richardpaulhudson Jul 15, 2021
7f73caa
German coreference tests; improved subject/object recognition in German
richardpaulhudson Jul 16, 2021
d3bdf0d
Coreference combined with subwords
richardpaulhudson Jul 16, 2021
8ab5873
New semantics for extracted_word
richardpaulhudson Jul 19, 2021
5bfff45
Suppress relation matches where subwords in search text and document
richardpaulhudson Jul 19, 2021
dc92833
Frequency factors driving relation- and embedding-based topic matching
richardpaulhudson Jul 20, 2021
369fc7a
Corpus-wide indexing
richardpaulhudson Jul 20, 2021
0b5ff9e
First part of question word processing
richardpaulhudson Aug 2, 2021
2120c45
Question word processing
richardpaulhudson Aug 2, 2021
c01a864
Question word processing
richardpaulhudson Aug 3, 2021
fab989e
Question word processing, preexisting tests passing, no new tests yet
richardpaulhudson Aug 3, 2021
ace7f87
Corrected multithreading test
richardpaulhudson Aug 3, 2021
be8c2c3
Question word processing, preexisting tests running, no new tests yet
richardpaulhudson Aug 3, 2021
61caafa
Corrections
richardpaulhudson Aug 3, 2021
c2de7f4
Correction to embedding-based matching reporting with subwords
richardpaulhudson Aug 4, 2021
22dd8c7
Question word processing, preexisting tests running, no new tests yet
richardpaulhudson Aug 4, 2021
97f1095
entity_embedding matching; Question word processing - first tests
richardpaulhudson Aug 4, 2021
48a0d09
Improvements and more tests
richardpaulhudson Aug 4, 2021
a466176
Updated license date
richardpaulhudson Aug 4, 2021
3b3c8e7
More tests and corrections
richardpaulhudson Aug 5, 2021
51da728
Further corrections
richardpaulhudson Aug 5, 2021
00071c1
English question word tests complete
richardpaulhudson Aug 5, 2021
aa0fec7
Correction
richardpaulhudson Aug 5, 2021
28c9098
More German tests
richardpaulhudson Aug 5, 2021
d5932d0
Correction
richardpaulhudson Aug 5, 2021
f5a40ea
Added English question test
richardpaulhudson Aug 5, 2021
285ac52
Improvements
richardpaulhudson Aug 6, 2021
bb35aa0
All question tests
richardpaulhudson Aug 6, 2021
267537a
Additional question rules and tests
richardpaulhudson Aug 24, 2021
d66ae27
Redid consoles
richardpaulhudson Aug 27, 2021
86b9a34
Redid examples
richardpaulhudson Aug 27, 2021
5c706d4
Updated subword rules to reflect more mature spaCy model
richardpaulhudson Aug 30, 2021
8c3fe6d
Improvements to question answering
richardpaulhudson Aug 30, 2021
c0a7819
Another improvement
richardpaulhudson Aug 30, 2021
a18a85c
Further improvements to question handling
richardpaulhudson Aug 30, 2021
2c9cd5f
Further improvements to English question handling
richardpaulhudson Aug 30, 2021
25849e2
English question examples finished
richardpaulhudson Aug 31, 2021
604ac63
More question answering improvements
richardpaulhudson Aug 31, 2021
99e8faa
Accept hyphens in question answers
richardpaulhudson Aug 31, 2021
44d97c8
German question examples finished
richardpaulhudson Aug 31, 2021
1b08270
Final touches to example scripts
richardpaulhudson Aug 31, 2021
54ef7db
Code corrections
richardpaulhudson Sep 6, 2021
6b2ee1e
Correction
richardpaulhudson Sep 6, 2021
8daafb7
Further correction
richardpaulhudson Sep 6, 2021
da7fcc2
Naming improvement
richardpaulhudson Sep 7, 2021
93c2945
saved (partially complete)
richardpaulhudson Sep 8, 2021
be1b36b
First draft of new README.md
richardpaulhudson Sep 8, 2021
07df2f8
Corrections
richardpaulhudson Sep 8, 2021
8b91c14
Corrections
richardpaulhudson Sep 8, 2021
b06dba5
Corrections
richardpaulhudson Sep 8, 2021
7b84175
Corrections
richardpaulhudson Sep 8, 2021
fd02ae9
Corrections
richardpaulhudson Sep 9, 2021
42b1d82
Increased TIMEOUT_SECONDS
richardpaulhudson Sep 9, 2021
7cc728e
Increased TIMEOUT_SECONDS
richardpaulhudson Sep 9, 2021
8ee3490
Comment about model resource requirements
richardpaulhudson Sep 10, 2021
7a8dc15
Added MANIFEST.in
richardpaulhudson Sep 10, 2021
ff9e836
Correction to MANIFEST.in
richardpaulhudson Sep 10, 2021
046d775
Updated SHORTREADME.md
richardpaulhudson Sep 10, 2021
3c06b78
Correction
richardpaulhudson Sep 10, 2021
3024df1
Note on installation
richardpaulhudson Sep 10, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright 2019-2020 msg systems ag
Copyright 2019-2021 msg systems ag

The Holmes library is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
Expand Down
4 changes: 4 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
include SHORTREADME.md
global-include *.cfg
global-include *.csv
global-include LICENSE
1,085 changes: 434 additions & 651 deletions README.md

Large diffs are not rendered by default.

32 changes: 17 additions & 15 deletions SHORTREADME.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,51 @@
**Holmes** is a Python 3 library (tested with version 3.7.7) that supports a number of
use cases involving information extraction from English and German texts. In all use cases, the information extraction
is based on analysing the semantic relationships expressed by the component parts of each sentence:
**Holmes** is a Python 3 library (tested with version 3.9.5) running on top of
[spaCy](https://spacy.io/) (tested with version 3.1.2) that supports a number of use cases
involving information extraction from English and German texts. In all use cases, the information
extraction is based on analysing the semantic relationships expressed by the component parts of
each sentence:

- In the [chatbot](https://github.com/msg-systems/holmes-extractor/#getting-started) use case, the system is configured using one or more **search phrases**.
- In the [chatbot](https://github.com/msg-systems/holmes-extractor#getting-started) use case, the system is configured using one or more **search phrases**.
Holmes then looks for structures whose meanings correspond to those of these search phrases within
a searched **document**, which in this case corresponds to an individual snippet of text or speech
entered by the end user. Within a match, each word with its own meaning (i.e. that does not merely fulfil a grammatical function) in the search phrase
corresponds to one or more such words in the document. Both the fact that a search phrase was matched and any structured information the search phrase extracts can be used to drive the chatbot.

- The [structural extraction](https://github.com/msg-systems/holmes-extractor/#structural-extraction) use case uses exactly the same
[structural matching](https://github.com/msg-systems/holmes-extractor/#how-it-works-structural-matching) technology as the chatbot use
- The [structural extraction](https://github.com/msg-systems/holmes-extractor#structural-extraction) use case uses exactly the same
[structural matching](https://github.com/msg-systems/holmes-extractor#how-it-works-structural-matching) technology as the chatbot use
case, but searching takes place with respect to a pre-existing document or documents that are typically much
longer than the snippets analysed in the chatbot use case, and the aim to extract and store structured information. For example, a set of business articles could be searched to find all the places where one company is said to be planning to
longer than the snippets analysed in the chatbot use case, and the aim is to extract and store structured information. For example, a set of business articles could be searched to find all the places where one company is said to be planning to
take over a second company. The identities of the companies concerned could then be stored in a database.

- The [topic matching](https://github.com/msg-systems/holmes-extractor/#topic-matching) use case aims to find passages in a document or documents whose meaning
- The [topic matching](https://github.com/msg-systems/holmes-extractor#topic-matching) use case aims to find passages in a document or documents whose meaning
is close to that of another document, which takes on the role of the **query document**, or to that of a **query phrase** entered ad-hoc by the user. Holmes extracts a number of small **phraselets** from the query phrase or
query document, matches the documents being searched against each phraselet, and conflates the results to find the
most relevant passages within the documents. Because there is no strict requirement that every word with its own
meaning in the query document match a specific word or words in the searched documents, more matches are found
query document, matches the documents being searched against each phraselet, and conflates the results to find
the most relevant passages within the documents. Because there is no strict requirement that every
word with its own meaning in the query document match a specific word or words in the searched documents, more matches are found
than in the structural extraction use case, but the matches do not contain structured information that can be
used in subsequent processing. The topic matching use case is demonstrated by [a website allowing searches within
the Harry Potter corpus (for English) and around 350 traditional stories (for German)](http://holmes-demo.xt.msg.team/).

- The [supervised document classification](https://github.com/msg-systems/holmes-extractor/#supervised-document-classification) use case uses training data to
- The [supervised document classification](https://github.com/msg-systems/holmes-extractor#supervised-document-classification) use case uses training data to
learn a classifier that assigns one or more **classification labels** to new documents based on what they are about.
It classifies a new document by matching it against phraselets that were extracted from the training documents in the
same way that phraselets are extracted from the query document in the topic matching use case. The technique is
inspired by bag-of-words-based classification algorithms that use n-grams, but aims to derive n-grams whose component
words are related semantically rather than that just happen to be neighbours in the surface representation of a language.

In all four use cases, the **individual words** are matched using a [number of strategies](https://github.com/msg-systems/holmes-extractor/#word-level-matching-strategies).
In all four use cases, the **individual words** are matched using a [number of strategies](https://github.com/msg-systems/holmes-extractor#word-level-matching-strategies).
To work out whether two grammatical structures that contain individually matching words correspond logically and
constitute a match, Holmes transforms the syntactic parse information provided by the [spaCy](https://spacy.io/) library
into semantic structures that allow texts to be compared using predicate logic. As a user of Holmes, you do not need to
understand the intricacies of how this works, although there are some
[important tips](https://github.com/msg-systems/holmes-extractor/#writing-effective-search-phrases) around writing effective search phrases for the chatbot and
[important tips](https://github.com/msg-systems/holmes-extractor#writing-effective-search-phrases) around writing effective search phrases for the chatbot and
structural extraction use cases that you should try and take on board.

Holmes aims to offer generalist solutions that can be used more or less out of the box with
relatively little tuning, tweaking or training and that are rapidly applicable to a wide range of use cases.
At its core lies a logical, programmed, rule-based system that describes how syntactic representations in each
language express semantic relationships. Although the supervised document classification use case does incorporate a
neural network and although the spaCy library upon which Holmes builds has itself been pre-trained using machine
learning, the essentially rule-based nature of Holmes means that the chatbot, structural matching and topic matching use
learning, the essentially rule-based nature of Holmes means that the chatbot, structural extraction and topic matching use
cases can be put to use out of the box without any training and that the supervised document classification use case
typically requires relatively little training data, which is a great advantage because pre-labelled training data is
not available for many real-world problems.
Expand Down
17 changes: 17 additions & 0 deletions examples/example_chatbot_DE_insurance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import os
import holmes_extractor as holmes

if __name__ in ('__main__', 'example_chatbot_DE_insurance'):
script_directory = os.path.dirname(os.path.realpath(__file__))
ontology = holmes.Ontology(os.sep.join((
script_directory, 'example_chatbot_DE_insurance_ontology.owl')))
holmes_manager = holmes.Manager(model='de_core_news_lg', ontology=ontology, number_of_workers=2)
holmes_manager.register_search_phrase('Jemand benötigt eine Versicherung')
holmes_manager.register_search_phrase('Ein ENTITYPER schließt eine Versicherung ab')
holmes_manager.register_search_phrase('ENTITYPER benötigt eine Versicherung')
holmes_manager.register_search_phrase('Eine Versicherung für einen Zeitraum')
holmes_manager.register_search_phrase('Eine Versicherung fängt an')
holmes_manager.register_search_phrase('Jemand zahlt voraus')

holmes_manager.start_chatbot_mode_console()
# e.g. 'Richard Hudson und Max Mustermann brauchen eine Krankenversicherung für die nächsten fünf Jahre'
20 changes: 20 additions & 0 deletions examples/example_chatbot_EN_insurance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import os
import holmes_extractor as holmes

if __name__ in ('__main__', 'example_chatbot_EN_insurance'):
script_directory = os.path.dirname(os.path.realpath(__file__))
ontology = holmes.Ontology(os.sep.join((
script_directory, 'example_chatbot_EN_insurance_ontology.owl')))
holmes_manager = holmes.Manager(
model='en_core_web_lg', ontology=ontology, number_of_workers=2)
holmes_manager.register_search_phrase('Somebody requires insurance')
holmes_manager.register_search_phrase('An ENTITYPERSON takes out insurance')
holmes_manager.register_search_phrase('A company buys payment insurance')
holmes_manager.register_search_phrase('An ENTITYPERSON needs insurance')
holmes_manager.register_search_phrase('Insurance for a period')
holmes_manager.register_search_phrase('An insurance begins')
holmes_manager.register_search_phrase('Somebody prepays')
holmes_manager.register_search_phrase('Somebody makes an insurance payment')

holmes_manager.start_chatbot_mode_console()
# e.g. 'Richard Hudson and John Doe require health insurance for the next five years'
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,11 @@ def download_and_register(url, label):
holmes_manager.parse_and_register_document(soup.get_text(), label)

# Start the Holmes Manager with the German model
holmes_manager = holmes.Manager(model='de_core_news_md')
download_and_register('https://www.gesetze-im-internet.de/vvg_2008/BJNR263110007.html', 'VVG_2008')
download_and_register('https://www.gesetze-im-internet.de/vag_2016/BJNR043410015.html', 'VAG')
holmes_manager.start_topic_matching_search_mode_console()
if __name__ in ('__main__', 'example_search_DE_law'):
holmes_manager = holmes.Manager(model='de_core_news_lg', number_of_workers=2)
download_and_register('https://www.gesetze-im-internet.de/vvg_2008/BJNR263110007.html', 'VVG_2008')
download_and_register('https://www.gesetze-im-internet.de/vag_2016/BJNR043410015.html', 'VAG')
holmes_manager.start_topic_matching_search_mode_console(initial_question_word_embedding_match_threshold=0.7)

# Example queries:
#
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,18 @@
HOLMES_EXTENSION = 'hdc'
flag_filename = os.sep.join((working_directory, 'STORY_PARSING_COMPLETE'))

print('Initializing Holmes...')
print('Initializing Holmes (this may take some time) ...')
# Start the Holmes manager with the German model
holmes_manager = holmes.MultiprocessingManager(
model='de_core_news_md', overall_similarity_threshold=0.85, number_of_workers=4)
# set number_of_workers to prevent memory exhaustion / swapping; it should never be more
# than the number of cores on the machine
holmes_manager = holmes.Manager(
model='de_core_news_lg')

def process_documents_from_front_page(
manager, front_page_uri, front_page_label):
def process_documents_from_front_page(front_page_uri, front_page_label):
""" Download and save all the stories from a front page."""

front_page = urllib.request.urlopen(front_page_uri)
front_page_soup = BeautifulSoup(front_page, 'html.parser')
document_texts = []
labels = []
# For each story ...
for anchor in front_page_soup.find_all('a'):
if not anchor['href'].startswith('/') and not anchor['href'].startswith('https'):
Expand All @@ -44,15 +43,16 @@ def process_documents_from_front_page(
this_document_text = ' '.join(this_document_text.split())
# Create a document label from the front page label and the story name
this_document_label = ' - '.join((front_page_label, anchor.contents[0]))
# Parse the document
print('Parsing', this_document_label)
manager.parse_and_register_document(this_document_text, this_document_label)
# Save the document
print('Saving', this_document_label)
output_filename = os.sep.join((working_directory, this_document_label))
output_filename = '.'.join((output_filename, HOLMES_EXTENSION))
with open(output_filename, "w") as file:
file.write(manager.serialize_document(this_document_label))
document_texts.append(this_document_text)
labels.append(this_document_label)
parsed_documents = holmes_manager.nlp.pipe(document_texts)
for index, parsed_document in enumerate(parsed_documents):
label = labels[index]
print('Saving', label)
output_filename = os.sep.join((working_directory, label))
output_filename = '.'.join((output_filename, HOLMES_EXTENSION))
with open(output_filename, "wb") as file:
file.write(parsed_document.to_bytes())

def load_documents_from_working_directory():
serialized_documents = {}
Expand All @@ -61,31 +61,31 @@ def load_documents_from_working_directory():
print('Loading', file)
label = file[:-4]
long_filename = os.sep.join((working_directory, file))
with open(long_filename, "r") as file:
with open(long_filename, "rb") as file:
contents = file.read()
serialized_documents[label] = contents
holmes_manager.deserialize_and_register_documents(serialized_documents)
print('Indexing documents (this may take some time) ...')
holmes_manager.register_serialized_documents(serialized_documents)

if os.path.exists(working_directory):
if not os.path.isdir(working_directory):
raise RuntimeError(' '.join((working_directory), 'must be a directory'))
raise RuntimeError(' '.join((working_directory, 'must be a directory')))
else:
os.mkdir(working_directory)

if os.path.isfile(flag_filename):
load_documents_from_working_directory()
else:
normal_holmes_manager = holmes.Manager(model='de_core_news_md')
process_documents_from_front_page(
normal_holmes_manager, "https://maerchen.com/grimm/", 'Gebrüder Grimm')
"https://maerchen.com/grimm/", 'Gebrüder Grimm')
process_documents_from_front_page(
normal_holmes_manager, "https://maerchen.com/grimm2/", 'Gebrüder Grimm')
"https://maerchen.com/grimm2/", 'Gebrüder Grimm')
process_documents_from_front_page(
normal_holmes_manager, "https://maerchen.com/andersen/", 'Hans Christian Andersen')
"https://maerchen.com/andersen/", 'Hans Christian Andersen')
process_documents_from_front_page(
normal_holmes_manager, "https://maerchen.com/bechstein/", 'Ludwig Bechstein')
"https://maerchen.com/bechstein/", 'Ludwig Bechstein')
process_documents_from_front_page(
normal_holmes_manager, "https://maerchen.com/wolf/", 'Johann Wilhelm Wolf')
"https://maerchen.com/wolf/", 'Johann Wilhelm Wolf')
# Generate flag file to indicate files can be reloaded on next run
open(flag_filename, 'a').close()
load_documents_from_working_directory()
Expand All @@ -101,8 +101,8 @@ def load_documents_from_working_directory():

class RestHandler():
def on_get(self, req, resp):
resp.body = \
json.dumps(holmes_manager.topic_match_documents_returning_dictionaries_against(
resp.text = \
json.dumps(holmes_manager.topic_match_documents_against(
req.params['entry'][0:200], only_one_result_per_document=True))
resp.cache_control = ["s-maxage=31536000"]

Expand Down
Loading