Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #2

Merged
merged 130 commits into from
Mar 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
130 commits
Select commit Hold shift + click to select a range
1bac178
Set up general workflow for new NER function in bio_output_creation.py
Oct 26, 2020
4ee0fce
Tokens tagged with 'U' or 'X' in Nestor are 'O'
Oct 28, 2020
442a91f
Fix string matching when searching vocab file for token
Oct 28, 2020
a1c540e
Move code into function
Oct 28, 2020
3d572d4
Move new (draft) function to keyword.py
Oct 28, 2020
8a47670
add new keyword tests (token_to_alias example)
Nov 3, 2020
cd23fb6
fixed iob_extractor, added test (failed)
Nov 3, 2020
8bbc771
Access 'tokens' column instead of index for string matching
Nov 4, 2020
f85b2ea
Switch to nestor-native utility functions (load_excavators, generate_…
rtbs-dev Nov 5, 2020
743f5aa
Reformat iob_extractor output as DataFrame and move notebook to examples
Nov 6, 2020
63ba43c
Simplify tag logic
Nov 6, 2020
5e12ff6
iob_extractor handles combined tokens (i.e. 'grease_line') and passes…
Nov 6, 2020
ff1ae7c
use nestorParams for NE name groups
Nov 9, 2020
2a84237
fix gitignore
Nov 9, 2020
43a5f60
Add unit tests for combined tag labeling, begin to label tokens that …
Nov 9, 2020
f1e2e39
Pass test_iob_extractor_2g_tokens unit test
Nov 18, 2020
98879e9
Change loop behavior to pass test_iob_extractor_extended_tokens (all …
Nov 24, 2020
5a6672f
Minor clean up/commenting in IOB function
Nov 24, 2020
e2a5910
Go through MWO with while loop for better control of indexing multi-w…
Nov 25, 2020
73a0d68
Delete bio_output_creation.ipynb
conteam Nov 25, 2020
1ac9bb4
style edits to readme
rtbs-dev Dec 14, 2020
804acd8
fix dynamic-versioning issue (v0.10 poetry-core)
rtbs-dev Dec 16, 2020
bb16d81
After 'poetry run task format'
Dec 16, 2020
03db04c
Merge branch 'bio' of https://gitlab.nist.gov/gitlab/kea/nestor into bio
Dec 16, 2020
1c0ed4b
use nist header/footer
rtbs-dev Jan 28, 2021
6629109
add docs branch to gitlab ci
rtbs-dev Jan 28, 2021
c2a6b09
ensure updated poetry-dynamic-versioning
rtbs-dev Jan 28, 2021
bfb3d38
rm tables dep
rtbs-dev Jun 17, 2021
76231ff
working mkdocstrings and pdf
rtbs-dev Jun 21, 2021
e7a5fb7
optimize imports
rtbs-dev Jun 21, 2021
ef0c8df
subtitle pdf
rtbs-dev Aug 11, 2021
ec2f085
nestor cache folder and remote excavators
rtbs-dev Aug 11, 2021
1d9cd96
get and test excavators from remote (uwa)
rtbs-dev Aug 12, 2021
3892b71
rm deprecated excavator csv's
rtbs-dev Aug 12, 2021
3f9f886
rm csv from pyproject.toml
rtbs-dev Aug 12, 2021
83f96d2
use mamba runner and modify ci/cd
rtbs-dev Aug 12, 2021
411b821
don't test docs... unnecessary
rtbs-dev Aug 12, 2021
0b03a62
cache pip and use base (poetry)
rtbs-dev Aug 12, 2021
d03ef85
cache...nvm
rtbs-dev Aug 12, 2021
1a8645b
switch to conda if headless-chrome is a prob?
rtbs-dev Aug 12, 2021
2affdb2
no conda init?
rtbs-dev Aug 12, 2021
ef13eb7
headless-chrome was needed by render_js...
rtbs-dev Aug 12, 2021
4075b98
try to add roboto
rtbs-dev Aug 13, 2021
6b47328
set gitlab repo_url in ci yaml
rtbs-dev Aug 13, 2021
810f94c
mkdocs env var and apt-get fonts-roboto
rtbs-dev Aug 13, 2021
7235d5a
spacessss
rtbs-dev Aug 13, 2021
97a2530
better stages
rtbs-dev Aug 13, 2021
63f240d
Merge branch 'dev' of gitlab.nist.gov:kea/nestor into dev
rtbs-dev Aug 13, 2021
cb86304
Update .gitlab-ci.yml
rtbs-dev Aug 13, 2021
f9b8471
test report and apt-get -y
rtbs-dev Aug 13, 2021
cdac001
cache conda
rtbs-dev Aug 13, 2021
3342283
fix mkdocs url and before_script
rtbs-dev Aug 13, 2021
0da2a89
rules are array of str
rtbs-dev Aug 13, 2021
a26207f
texlive fonts?
rtbs-dev Aug 13, 2021
8834a7e
apt -y
rtbs-dev Aug 13, 2021
a3d7a5b
try new kea image
rtbs-dev Aug 13, 2021
077b8f3
make poetry venv globally cached
rtbs-dev Aug 13, 2021
5fd9ee0
merge conflict fix
Aug 18, 2021
2c3b285
Create script to demonstrate file format required for using spacy and…
Aug 25, 2021
86572eb
Modify tag formatting to match IOB conventions, still not 100% functi…
Aug 25, 2021
b3958b4
Update testing for correct IOB format (updated tests dont pass yet)
Aug 31, 2021
f910114
Cleanup unit tests
Aug 31, 2021
1edfe4b
Update IOB function logic to handle inner tags, unit tests all pass
Sep 2, 2021
3993e86
Update IOB to handle multi-token X/U tags
Sep 2, 2021
5146125
more cleanup of spacy example
Sep 8, 2021
dc49298
Remove print statement from debugging
Sep 8, 2021
214d4d2
Clean up keyword unit tests
Sep 8, 2021
4792506
remove cache artefacts
rtbs-dev Sep 8, 2021
b4e36bb
modernize .gitignore
rtbs-dev Sep 8, 2021
6e36229
update to use `isinstance` rather than `type() is`
rtbs-dev Sep 8, 2021
faf2c01
refactor iob test to group w/ fixtures
rtbs-dev Sep 8, 2021
a4de848
Add unit test for tokens not in vocab list
Sep 9, 2021
ac245c2
working speed-up, needs non-vocab tokens to pass
rtbs-dev Sep 9, 2021
2ba9dd4
working iob refactor
rtbs-dev Sep 10, 2021
2354e8a
move notebooks to /notebooks
rtbs-dev Sep 10, 2021
61abaa7
Updated the readme with new links, text about TLP
MichaelPBrundage Sep 10, 2021
2306047
Merge remote-tracking branch 'nist-origin/bio' into bio-refactor
rtbs-dev Sep 10, 2021
f271a64
Merge branch 'bio-refactor' into 'bio'
rtbs-dev Sep 10, 2021
9c01883
Add headings/text to spacy example
Sep 10, 2021
b9dac72
new vocab interface (needs remote url's)
rtbs-dev Sep 10, 2021
4b1b261
new way to get vocabs
rtbs-dev Sep 10, 2021
39f8a6a
Merge branch 'bio-vocab' into bio
rtbs-dev Sep 10, 2021
acd0ef3
move notebooks to docs/examples
rtbs-dev Sep 10, 2021
e924afc
Create notebook for NLTK example
Sep 10, 2021
e24bf83
Create notebook for NLTK example
Sep 10, 2021
1eb18ae
Update spacy example
Sep 10, 2021
6a6bc1c
fix spaces in vocab2g, add ner-example subfolder
rtbs-dev Sep 10, 2021
650bd53
fix nans in iob-extract vocab dict
rtbs-dev Sep 10, 2021
8e4010b
Merge branch 'bio' of gitlab.nist.gov:kea/nestor into bio
rtbs-dev Sep 10, 2021
6f8159a
rename to order mkdocs pages
rtbs-dev Sep 10, 2021
88c6766
move example files
Sep 13, 2021
51050ed
Merge branch 'bio' of https://gitlab.nist.gov/gitlab/kea/nestor-suite…
Sep 13, 2021
0a31d3e
Update nestor-docs conda env dependencies
Sep 13, 2021
e931655
Update NLTK example and include md with output
Sep 13, 2021
8d7a523
Update Spacy example and add associated .md file with output
Sep 13, 2021
5b6dc6a
new TagExtractor to group utils. also, properties
rtbs-dev Sep 13, 2021
072c92d
working tagextractor and goodies
rtbs-dev Sep 14, 2021
1e842db
convenience funcs
rtbs-dev Sep 14, 2021
9384416
Merge branch 'tagextractor' into bio
rtbs-dev Sep 14, 2021
bd92b9e
add jupytext survival analysis
rtbs-dev Sep 14, 2021
b0a974f
update survival analysis w/ TagExtractor
rtbs-dev Sep 14, 2021
463c37b
Merge branch 'bio' into 'dev'
rtbs-dev Sep 14, 2021
5e43920
add codemeta.yaml
rtbs-dev Sep 15, 2021
76718ba
add notebook export to mkdocs
rtbs-dev Sep 15, 2021
f5f3866
Merge branch 'docs' of gitlab.nist.gov:kea/nestor into docs
rtbs-dev Sep 15, 2021
71a5ba2
jupytext auto-build for mkdocs
rtbs-dev Sep 15, 2021
357fbf6
Merge branch 'docs-mknotebooks' into 'docs'
rtbs-dev Sep 15, 2021
7e9c569
needs
rtbs-dev Sep 15, 2021
4ec642f
nb build env stage
rtbs-dev Sep 15, 2021
ab37ec3
always test?
rtbs-dev Sep 15, 2021
73ee65b
mamba env update
rtbs-dev Sep 15, 2021
93a6381
mamba env create
rtbs-dev Sep 15, 2021
0649d1a
no quotations
rtbs-dev Sep 15, 2021
7a7e2f0
try wrapping in quotes
rtbs-dev Sep 15, 2021
9527181
rm extraneous stage
rtbs-dev Sep 15, 2021
32dcd3d
oh, micromamba
rtbs-dev Sep 15, 2021
b858b00
use conda again
rtbs-dev Sep 15, 2021
f529b76
correct container tag
rtbs-dev Sep 15, 2021
576425d
job-specific vars
rtbs-dev Sep 15, 2021
acc7a98
maybe auto github deploy from gitlab
rtbs-dev Sep 15, 2021
c2e1572
oh right, it's bash not posix sh
rtbs-dev Sep 15, 2021
db946af
nist-origin
rtbs-dev Sep 15, 2021
3fe5cb6
Merge branch 'docs' of gitlab.nist.gov:kea/nestor into docs
rtbs-dev Sep 15, 2021
506a215
try mv to nist-pages?
rtbs-dev Sep 15, 2021
2b9ab85
hopefully stage for gh-actions
rtbs-dev Oct 4, 2021
addc7f2
add workflow for nist-pages
rtbs-dev Oct 5, 2021
921c9c6
fix pattern type hint
rtbs-dev Oct 5, 2021
c51734d
cleaning up docs and api (mkdocstrings)
rtbs-dev Oct 6, 2021
8a13164
bugfix
rtbs-dev Nov 24, 2021
7cc7386
fix excavators(raw) url
rtbs-dev Nov 29, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/workflows/nist-pages.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: NIST Pages
on:
push:
branches:
- master
jobs:
deploy-pages:
container: ghcr.io/usnistgov/continuous-scientific-python:latest
env:
# POETRY_VIRTUALENVS_PATH: "${CI_PROJECT_DIR}/.cache/venv"
MKDOCS_SITE_DIR: "public"
MKDOCS_SITE_URL: "https://pages.nist.gov/nestor/"
MKDOCS_REPO_URL: "https://github.com/usnistgov/nestor"


steps:
- name: "install dev dependencies"
run: poetry install
- name: "subtree script for nist-pages branch"
run: |
./nist-pages-deploy.sh
194 changes: 165 additions & 29 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,29 +1,165 @@
# special folder cache or others
**/__pycache__/
.idea/
**/.ipynb_checkpoints
**.egg-info


build
dist
public
_build
_static
_templates
_pdf
.doctrees

# special file type
**.pyc
**.csv
**/reveal.js
**.h5

# exception
!nestor/settings.yaml
!nestor/datasets/**

/poetry.lock
**/.mypy_cache/
**/.auctec-auto
# use jupytext
*.ipynb


### GITIGNORE.io TEMPLATE ###

# Created by https://www.toptal.com/developers/gitignore/api/python,jupyternotebooks
# Edit at https://www.toptal.com/developers/gitignore?templates=python,jupyternotebooks

### JupyterNotebooks ###
# gitignore template for Jupyter Notebooks
# website: http://jupyter.org/

.ipynb_checkpoints
*/.ipynb_checkpoints/*

# IPython
profile_default/
ipython_config.py

# Remove previous ipynb_checkpoints
# git rm -r .ipynb_checkpoints/

### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook

# IPython

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# End of https://www.toptal.com/developers/gitignore/api/python,jupyternotebooks
/public/
/.idea/
/docs/.auctex-auto/
*.iob
63 changes: 44 additions & 19 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
@@ -1,31 +1,56 @@
image: python:3.7
default:
image: $CI_REGISTRY/kea/templates/docker-conda:master
before_script:
- poetry install
cache:
key: ${CI_COMMIT_REF_SLUG}
paths:
- .cache/venv

stages:
# - test
# - lint
- test
- deploy

before_script:
- conda env update --file gldeployenv.yaml
- conda activate nestor-dev
- poetry install

# Unit Tests:
# stage: test
# script:
# - poetry run pytest

# Python Code Lint:
# stage: lint
# script:
# - poetry run task format
variables:
POETRY_VIRTUALENVS_PATH: "${CI_PROJECT_DIR}/.cache/venv"
MKDOCS_SITE_DIR: "public"

pytest:
stage: test
script:
- poetry run pytest --junitxml=report.xml
artifacts:
when: always
reports:
junit: report.xml

pages:
stage: deploy
variables:
MKDOCS_SITE_URL: "https://kea.ipages.nist.gov/nestor-suite/nestor/"
MKDOCS_REPO_URL: "https://gitlab.nist.gov/gitlab/kea/nestor-suite/nestor/"

script:
- poetry run task deploy-docs
# - conda env update --file docs/examples/nb-env.yml
# - jupytext --to notebook --execute docs/examples/**/*.py
- poetry run task deploy
artifacts:
paths:
- public
only:
- dev
- docs
- master

# nist-pages:
# stage: deploy
# variables:
# MKDOCS_SITE_URL: "https://pages.nist.gov/nestor/"
# MKDOCS_REPO_URL: "https://github.com/usnistgov/nestor"

# script:
# - conda env update --file docs/examples/nb-env.yml
# - jupytext --to notebook --execute docs/examples/**/*.py
# - poetry run task deploy-docs
# - git push nist-origin `git subtree split --prefix public master`:nist-pages --force
# only:
# - master
17 changes: 17 additions & 0 deletions CODEMETA.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# NIST Opensource Portal repositories categories and themes.
# Source: https://www.nist.gov/topics
# The following NIST topics and categories are based on the NIST
# Taxonomy. It can be found here:
# https://data.nist.gov/od/id/691DDF3315711C14E0532457068146BE1907

categories:
- scientific-software:
- python
- annotation
- ai-ml:
- NLP
themes:
- Artificial intelligence
- Information Technology
- Manufacturing
- Mathematics & Statistics
42 changes: 25 additions & 17 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,26 @@

## Purpose

**Nestor** is a toolkit for using Natural Language Processing (NLP) with efficient user-interaction to perform structured data extraction with minimal annotation time-cost.

### The Problem
NLP in technical domains requires context sensitivity.
Whether for medical notes, engineering work-orders, or social/behavioral coding, experts often use specialized vocabulary with over-loaded meanings and jargon.
This is incredibly difficult for off-the-shelf NLP systems to parse through.

The common solution is to contextualize NLP models.
For instance, medical NLP has been greatly advanced with the advent of labeled, bio-specific datasets, which have domain-relevant named-entity tags and vocabulary sets.
The common solution is to contextualize and adapt NLP models to technical text -- Technical Language Processing (TLP)[@brundage2020technical].
For instance, medical research has been greatly advanced with the advent of labeled, bio-specific datasets, which have domain-relevant named-entity tags and vocabulary sets.
Unfortunately for analysts of these types of data, creating resources like this is incredibly time consuming.
This is where `nestor` comes in.

### Why Maintenance and Manufacturing?

A reader may notice a heavy focus on maintenance and manufacturing in the Nestor documentation and design.
While this is a common problem in technical domains, generally, Nestor got its start in manufacturing data analysis.
A large amount of maintenance data is *already* available for use in advanced manufacturing systems, but in a currently-unusable form: service tickets and maintenance work orders (MWOs).

For further reading, see [@sexton2017hybrid] [@sharp2017toward] [@brundage2020technical].

## Quick Links

- [Get started](getting-started.md)
Expand All @@ -34,13 +45,7 @@ This application was originally designed to help manufacturers "tag" their maint
The goal is to help build context-rich labels in data sets that previously were too unstructured or filled with jargon to analyze.
The current build is in very early alpha, so please be patient in using this application. If you have any questions, please do not hesitate to contact us (see [Who are we?](#who-are-we). )

### Why?

There is often a large amount of maintenance data *already* available for use in Smart Manufacturing systems, but in a currently-unusable form: service tickets and maintenance work orders (MWOs).
**Nestor** is a toolkit for using Natural Language Processing (NLP) with efficient user-interaction to perform structured data extraction with minimal annotation time-cost.
For further reading, see [@sexton2017hybrid] [@sharp2017toward].

### Features

- Rank keywords found in your data by importance, saving you time
- Suggest term unification by similarity (e.g. spelling), for quick review
Expand All @@ -51,26 +56,26 @@ For further reading, see [@sexton2017hybrid] [@sharp2017toward].

Planned:

- Customizable entity types and rules
- export to NER training formats
- command-line app and REST API
- Customizable entity types and rules,
- Export to NER training formats,
- Command-line app and REST API.


## Who are we?

This toolkit is a part of the Knowledge Extraction and Application for Smart Manufacturing (KEA) project, within the Systems Integration Division at NIST.
This toolkit is a part of the [Knowledge Extraction and Application for Smart Manufacturing (KEA)](https://www.nist.gov/programs-projects/knowledge-extraction-and-application-manufacturing-operations) project, within the [Systems Integration Division](https://www.nist.gov/el/systems-integration-division-73400) at NIST.


### Projects that use Nestor

- Various [Nestor GUIs](gui-links.md)
- [nestor exploratory data analysis](https://github.com/usnistgov/nestor-eda) (dashboard, viz, etc.)
- Various [Nestor GUIs](gui-links.md): ways to use the full human-centered Nestor workflow in a user-interface.
- [`nestor-eda`](https://github.com/usnistgov/nestor-eda): (exploratory data analysis): things to do with Nestor-annotated data (dashboard, viz, etc.)


### Points of Contact
- Email the dev team at <[email protected]>
- [Thurston Sexton](https://www.nist.gov/people/thurston-sexton) [@tbsexton](https://github.com/tbsexton) Nestor Technical Lead
- [Michael Brundage](https://www.nist.gov/people/michael-p-brundage) Principal Investigator
- Email the development team at <[email protected]>
- [Thurston Sexton](https://www.nist.gov/people/thurston-sexton) [@tbsexton](https://github.com/tbsexton) Nestor Technical Lead, Associate Project Leader
- [Michael Brundage](https://www.nist.gov/people/michael-p-brundage) Project Leader


### Why "KEA"?
Expand All @@ -85,4 +90,7 @@ Plugins are installed as development dependencies through poetry (e.g. `taskipy`

Notebooks should be kept nicely git-friendly with [Jupytext](https://github.com/mwouts/jupytext)

## Other Tools/Resources
Know of other tools? Or want to find similar resources as Nestor? A community driven [TLP Community of Interest (COI)](https://www.nist.gov/el/technical-language-processing-community-interest) has been created to provide publicly available resources to the community. Check out our [awesomelist](https://github.com/TLP-COI/awesome-tlp).

\bibliography
Loading