msg-systems · richardpaulhudson · Sep 10, 2021 · Feb 3, 2021 · Feb 5, 2021 · Feb 5, 2021
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,129 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
diff --git a/LICENSE b/LICENSE
@@ -1,4 +1,4 @@
-Copyright 2019-2020 msg systems ag
+Copyright 2019-2021 msg systems ag
 
 The Holmes library is free software: you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,4 @@
+include SHORTREADME.md
+global-include *.cfg
+global-include *.csv
+global-include LICENSE
diff --git a/README.md b/README.md
diff --git a/SHORTREADME.md b/SHORTREADME.md
@@ -1,49 +1,51 @@
-**Holmes** is a Python 3 library (tested with version 3.7.7) that supports a number of
-use cases involving information extraction from English and German texts. In all use cases, the information extraction
-is based on analysing the semantic relationships expressed by the component parts of each sentence:
+**Holmes** is a Python 3 library (tested with version 3.9.5) running on top of
+[spaCy](https://spacy.io/) (tested with version 3.1.2) that supports a number of use cases
+involving information extraction from English and German texts. In all use cases, the information
+extraction is based on analysing the semantic relationships expressed by the component parts of
+each sentence:
 
-- In the [chatbot](https://github.com/msg-systems/holmes-extractor/#getting-started) use case, the system is configured using one or more **search phrases**.
+- In the [chatbot](https://github.com/msg-systems/holmes-extractor#getting-started) use case, the system is configured using one or more **search phrases**.
 Holmes then looks for structures whose meanings correspond to those of these search phrases within
 a searched **document**, which in this case corresponds to an individual snippet of text or speech
 entered by the end user. Within a match, each word with its own meaning (i.e. that does not merely fulfil a grammatical function) in the search phrase
 corresponds to one or more such words in the document. Both the fact that a search phrase was matched and any structured information the search phrase extracts can be used to drive the chatbot.
 
-- The [structural extraction](https://github.com/msg-systems/holmes-extractor/#structural-extraction) use case uses exactly the same
-[structural matching](https://github.com/msg-systems/holmes-extractor/#how-it-works-structural-matching) technology as the chatbot use
+- The [structural extraction](https://github.com/msg-systems/holmes-extractor#structural-extraction) use case uses exactly the same
+[structural matching](https://github.com/msg-systems/holmes-extractor#how-it-works-structural-matching) technology as the chatbot use
 case, but searching takes place with respect to a pre-existing document or documents that are typically much
-longer than the snippets analysed in the chatbot use case, and the aim to extract and store structured information. For example, a set of business articles could be searched to find all the places where one company is said to be planning to
+longer than the snippets analysed in the chatbot use case, and the aim is to extract and store structured information. For example, a set of business articles could be searched to find all the places where one company is said to be planning to
 take over a second company. The identities of the companies concerned could then be stored in a database.
 
-- The [topic matching](https://github.com/msg-systems/holmes-extractor/#topic-matching) use case aims to find passages in a document or documents whose meaning
+- The [topic matching](https://github.com/msg-systems/holmes-extractor#topic-matching) use case aims to find passages in a document or documents whose meaning
 is close to that of another document, which takes on the role of the **query document**, or to that of a **query phrase** entered ad-hoc by the user. Holmes extracts a number of small **phraselets** from the query phrase or
-query document, matches the documents being searched against each phraselet, and conflates the results to find the
-most relevant passages within the documents. Because there is no strict requirement that every word with its own
-meaning in the query document match a specific word or words in the searched documents, more matches are found
+query document, matches the documents being searched against each phraselet, and conflates the results to find
+the most relevant passages within the documents. Because there is no strict requirement that every
+word with its own meaning in the query document match a specific word or words in the searched documents, more matches are found
 than in the structural extraction use case, but the matches do not contain structured information that can be
 used in subsequent processing. The topic matching use case is demonstrated by [a website allowing searches within
 the Harry Potter corpus (for English) and around 350 traditional stories (for German)](http://holmes-demo.xt.msg.team/).
 
-- The [supervised document classification](https://github.com/msg-systems/holmes-extractor/#supervised-document-classification) use case uses training data to
+- The [supervised document classification](https://github.com/msg-systems/holmes-extractor#supervised-document-classification) use case uses training data to
 learn a classifier that assigns one or more **classification labels** to new documents based on what they are about.
 It classifies a new document by matching it against phraselets that were extracted from the training documents in the
 same way that phraselets are extracted from the query document in the topic matching use case. The technique is
 inspired by bag-of-words-based classification algorithms that use n-grams, but aims to derive n-grams whose component
 words are related semantically rather than that just happen to be neighbours in the surface representation of a language.
 
-In all four use cases, the **individual words** are matched using a [number of strategies](https://github.com/msg-systems/holmes-extractor/#word-level-matching-strategies).
+In all four use cases, the **individual words** are matched using a [number of strategies](https://github.com/msg-systems/holmes-extractor#word-level-matching-strategies).
 To work out whether two grammatical structures that contain individually matching words correspond logically and
 constitute a match, Holmes transforms the syntactic parse information provided by the [spaCy](https://spacy.io/) library
 into semantic structures that allow texts to be compared using predicate logic. As a user of Holmes, you do not need to
 understand the intricacies of how this works, although there are some
-[important tips](https://github.com/msg-systems/holmes-extractor/#writing-effective-search-phrases) around writing effective search phrases for the chatbot and
+[important tips](https://github.com/msg-systems/holmes-extractor#writing-effective-search-phrases) around writing effective search phrases for the chatbot and
 structural extraction use cases that you should try and take on board.
 
 Holmes aims to offer generalist solutions that can be used more or less out of the box with
 relatively little tuning, tweaking or training and that are rapidly applicable to a wide range of use cases.
 At its core lies a logical, programmed, rule-based system that describes how syntactic representations in each
 language express semantic relationships. Although the supervised document classification use case does incorporate a
 neural network and although the spaCy library upon which Holmes builds has itself been pre-trained using machine
-learning, the essentially rule-based nature of Holmes means that the chatbot, structural matching and topic matching use
+learning, the essentially rule-based nature of Holmes means that the chatbot, structural extraction and topic matching use
 cases can be put to use out of the box without any training and that the supervised document classification use case
 typically requires relatively little training data, which is a great advantage because pre-labelled training data is
 not available for many real-world problems.

diff --git a/examples/example_chatbot_DE_insurance.py b/examples/example_chatbot_DE_insurance.py
@@ -0,0 +1,17 @@
+import os
+import holmes_extractor as holmes
+
+if __name__ in ('__main__', 'example_chatbot_DE_insurance'):
+    script_directory = os.path.dirname(os.path.realpath(__file__))
+    ontology = holmes.Ontology(os.sep.join((
+        script_directory, 'example_chatbot_DE_insurance_ontology.owl')))
+    holmes_manager = holmes.Manager(model='de_core_news_lg', ontology=ontology, number_of_workers=2)
+    holmes_manager.register_search_phrase('Jemand benötigt eine Versicherung')
+    holmes_manager.register_search_phrase('Ein ENTITYPER schließt eine Versicherung ab')
+    holmes_manager.register_search_phrase('ENTITYPER benötigt eine Versicherung')
+    holmes_manager.register_search_phrase('Eine Versicherung für einen Zeitraum')
+    holmes_manager.register_search_phrase('Eine Versicherung fängt an')
+    holmes_manager.register_search_phrase('Jemand zahlt voraus')
+
+    holmes_manager.start_chatbot_mode_console()
+    # e.g. 'Richard Hudson und Max Mustermann brauchen eine Krankenversicherung für die nächsten fünf Jahre'
diff --git a/...example_chatbot_DE_insurance_ontology.owl → ...example_chatbot_DE_insurance_ontology.owl b/...example_chatbot_DE_insurance_ontology.owl → ...example_chatbot_DE_insurance_ontology.owl
diff --git a/examples/example_chatbot_EN_insurance.py b/examples/example_chatbot_EN_insurance.py
@@ -0,0 +1,20 @@
+import os
+import holmes_extractor as holmes
+
+if __name__ in ('__main__', 'example_chatbot_EN_insurance'):
+    script_directory = os.path.dirname(os.path.realpath(__file__))
+    ontology = holmes.Ontology(os.sep.join((
+        script_directory, 'example_chatbot_EN_insurance_ontology.owl')))
+    holmes_manager = holmes.Manager(
+        model='en_core_web_lg', ontology=ontology, number_of_workers=2)
+    holmes_manager.register_search_phrase('Somebody requires insurance')
+    holmes_manager.register_search_phrase('An ENTITYPERSON takes out insurance')
+    holmes_manager.register_search_phrase('A company buys payment insurance')
+    holmes_manager.register_search_phrase('An ENTITYPERSON needs insurance')
+    holmes_manager.register_search_phrase('Insurance for a period')
+    holmes_manager.register_search_phrase('An insurance begins')
+    holmes_manager.register_search_phrase('Somebody prepays')
+    holmes_manager.register_search_phrase('Somebody makes an insurance payment')
+
+    holmes_manager.start_chatbot_mode_console()
+    # e.g. 'Richard Hudson and John Doe require health insurance for the next five years'
diff --git a/...example_chatbot_EN_insurance_ontology.owl → ...example_chatbot_EN_insurance_ontology.owl b/...example_chatbot_EN_insurance_ontology.owl → ...example_chatbot_EN_insurance_ontology.owl
diff --git a/...tractor/examples/example_search_DE_law.py → examples/example_search_DE_law.py b/...tractor/examples/example_search_DE_law.py → examples/example_search_DE_law.py
@@ -13,10 +13,11 @@ def download_and_register(url, label):
     holmes_manager.parse_and_register_document(soup.get_text(), label)
 
 # Start the Holmes Manager with the German model
-holmes_manager = holmes.Manager(model='de_core_news_md')
-download_and_register('https://www.gesetze-im-internet.de/vvg_2008/BJNR263110007.html', 'VVG_2008')
-download_and_register('https://www.gesetze-im-internet.de/vag_2016/BJNR043410015.html', 'VAG')
-holmes_manager.start_topic_matching_search_mode_console()
+if __name__ in ('__main__', 'example_search_DE_law'):
+    holmes_manager = holmes.Manager(model='de_core_news_lg', number_of_workers=2)
+    download_and_register('https://www.gesetze-im-internet.de/vvg_2008/BJNR263110007.html', 'VVG_2008')
+    download_and_register('https://www.gesetze-im-internet.de/vag_2016/BJNR043410015.html', 'VAG')
+    holmes_manager.start_topic_matching_search_mode_console(initial_question_word_embedding_match_threshold=0.7)
 
 # Example queries:
 #

diff --git a/.../examples/example_search_DE_literature.py → examples/example_search_DE_literature.py b/.../examples/example_search_DE_literature.py → examples/example_search_DE_literature.py
@@ -11,19 +11,18 @@
     HOLMES_EXTENSION = 'hdc'
     flag_filename = os.sep.join((working_directory, 'STORY_PARSING_COMPLETE'))
 
-    print('Initializing Holmes...')
+    print('Initializing Holmes (this may take some time) ...')
     # Start the Holmes manager with the German model
-    holmes_manager = holmes.MultiprocessingManager(
-        model='de_core_news_md', overall_similarity_threshold=0.85, number_of_workers=4)
-        # set number_of_workers to prevent memory exhaustion / swapping; it should never be more
-        # than the number of cores on the machine
+    holmes_manager = holmes.Manager(
+        model='de_core_news_lg')
 
-    def process_documents_from_front_page(
-            manager, front_page_uri, front_page_label):
+    def process_documents_from_front_page(front_page_uri, front_page_label):
         """ Download and save all the stories from a front page."""
 
         front_page = urllib.request.urlopen(front_page_uri)
         front_page_soup = BeautifulSoup(front_page, 'html.parser')
+        document_texts = []
+        labels = []
         # For each story ...
         for anchor in front_page_soup.find_all('a'):
             if not anchor['href'].startswith('/') and not anchor['href'].startswith('https'):
@@ -44,15 +43,16 @@ def process_documents_from_front_page(
                 this_document_text = ' '.join(this_document_text.split())
                 # Create a document label from the front page label and the story name
                 this_document_label = ' - '.join((front_page_label, anchor.contents[0]))
-                # Parse the document
-                print('Parsing', this_document_label)
-                manager.parse_and_register_document(this_document_text, this_document_label)
-                # Save the document
-                print('Saving', this_document_label)
-                output_filename = os.sep.join((working_directory, this_document_label))
-                output_filename = '.'.join((output_filename, HOLMES_EXTENSION))
-                with open(output_filename, "w") as file:
-                    file.write(manager.serialize_document(this_document_label))
+                document_texts.append(this_document_text)
+                labels.append(this_document_label)
+        parsed_documents = holmes_manager.nlp.pipe(document_texts)
+        for index, parsed_document in enumerate(parsed_documents):
+            label = labels[index]
+            print('Saving', label)
+            output_filename = os.sep.join((working_directory, label))
+            output_filename = '.'.join((output_filename, HOLMES_EXTENSION))
+            with open(output_filename, "wb") as file:
+                file.write(parsed_document.to_bytes())
 
     def load_documents_from_working_directory():
         serialized_documents = {}
@@ -61,31 +61,31 @@ def load_documents_from_working_directory():
                 print('Loading', file)
                 label = file[:-4]
                 long_filename = os.sep.join((working_directory, file))
-                with open(long_filename, "r") as file:
+                with open(long_filename, "rb") as file:
                     contents = file.read()
                 serialized_documents[label] = contents
-        holmes_manager.deserialize_and_register_documents(serialized_documents)
+        print('Indexing documents (this may take some time) ...')
+        holmes_manager.register_serialized_documents(serialized_documents)
 
     if os.path.exists(working_directory):
         if not os.path.isdir(working_directory):
-            raise RuntimeError(' '.join((working_directory), 'must be a directory'))
+            raise RuntimeError(' '.join((working_directory, 'must be a directory')))
     else:
         os.mkdir(working_directory)
 
     if os.path.isfile(flag_filename):
         load_documents_from_working_directory()
     else:
-        normal_holmes_manager = holmes.Manager(model='de_core_news_md')
         process_documents_from_front_page(
-            normal_holmes_manager, "https://maerchen.com/grimm/", 'Gebrüder Grimm')
+            "https://maerchen.com/grimm/", 'Gebrüder Grimm')
         process_documents_from_front_page(
-            normal_holmes_manager, "https://maerchen.com/grimm2/", 'Gebrüder Grimm')
+            "https://maerchen.com/grimm2/", 'Gebrüder Grimm')
         process_documents_from_front_page(
-            normal_holmes_manager, "https://maerchen.com/andersen/", 'Hans Christian Andersen')
+            "https://maerchen.com/andersen/", 'Hans Christian Andersen')
         process_documents_from_front_page(
-            normal_holmes_manager, "https://maerchen.com/bechstein/", 'Ludwig Bechstein')
+            "https://maerchen.com/bechstein/", 'Ludwig Bechstein')
         process_documents_from_front_page(
-            normal_holmes_manager, "https://maerchen.com/wolf/", 'Johann Wilhelm Wolf')
+            "https://maerchen.com/wolf/", 'Johann Wilhelm Wolf')
         # Generate flag file to indicate files can be reloaded on next run
         open(flag_filename, 'a').close()
         load_documents_from_working_directory()
@@ -101,8 +101,8 @@ def load_documents_from_working_directory():
 
     class RestHandler():
         def on_get(self, req, resp):
-            resp.body = \
-                json.dumps(holmes_manager.topic_match_documents_returning_dictionaries_against(
+            resp.text = \
+                json.dumps(holmes_manager.topic_match_documents_against(
                     req.params['entry'][0:200], only_one_result_per_document=True))
             resp.cache_control = ["s-maxage=31536000"]