- Prevent section titles from capturing surrounding tokens, causing overlaps (#113)
- New
SimstringMatcher
matcher to perform fuzzy term matching, andalgorithm
parameter in terminology components andeds.matcher
component - Makefile to install,test the application and see the documentation
- Add consultation date pattern "CS", and False Positive patterns for dates (namely phone numbers and pagination).
- Update the pipeline score
eds.TNM
. Now it is possible to return a dictionary where the results are eitherstr
orint
values
- Add new patterns to the negation qualifier
- Numpy header issues with binary distributed packages
- Simstring dependency on Windows
- Now possible to provide regex flags when using the RegexMatcher
- New
ContextualMatcher
pipe, aiming at replacing theAdvancedRegex
pipe. - New
as_ents
parameter foreds.dates
, to save detected dates as entities
- Faster
eds.sentences
pipeline component with Cython - Bump version of Pydantic in
requirements.txt
to 1.8.2 to handle an incompatibility with the ContextualMatcher - Optimise space requirements by using
.csv.gz
compression for verbs
eds.sentences
behaviour with dot-delimited dates (eg02.07.2022
, which counted as three sentences)
- Complete revamp of the measurements detection pipeline, with better parsing and more exhaustive matching
- Add new functionality to the method
Span._.date.to_datetime()
to return a result infered from context for those cases with missing information. - Force a batch size of 2000 when distributing a pipeline with Spark
- New patterns to pipeline
eds.dates
to identify cases where only the month is mentioned - New
eds.terminology
component for generic terminology matching, using thekb_id_
attribute to store fine-grained entity label - New
eds.cim10
terminology matching pipeline - New
eds.drugs
terminology pipeline that maps brand names and active ingredients to a unique ATC code
- Support for strings in the example utility
- TNM detection and normalisation with the
eds.TNM
pipeline - Support for arbitrary callback for Pandas multiprocessing, with the
callback
argument
- Support for chained attributes in the
processing
pipelines - Colour utility with the category20 colour palette
- Correct a REGEX on the date detector (both
nov
andnov.
are now detected, as all other months)
- Updated Numpy requirements to be compatible with the
EDSPhraseMatcher
- New
eds
language to better fit French clinical documents and improve speed - Testing for markdown codeblocks to make sure the documentation is actually executable
- Complete revamp of the date detection pipeline, with better parsing and more exhaustive matching
- Reimplementation of the EDSPhraseMatcher in Cython, leading to a x15 speed increase
- Add
measures
pipeline - Cap Jinja2 version to fix mkdocs
- Adding the possibility to add context in the processing module
- Improve the speed of char replacement pipelines (accents and quotes)
- Improve the speed of the regex matcher
- Fix regex matching on spans.
- Add fast_parse in date pipeline.
- Add relative_date information parsing
- Fix issue with
dateparser
library (see scrapinghub/dateparser#1045) - Fix
attr
issue in theadvanced-regex
pipelin - Add documentation for
eds.covid
- Update the demo with an explanation for the regex
- Added support to Koalas DataFrames in the
edsnlp.processing
pipe. - Added
eds.covid
NER pipeline for detecting COVID19 mentions.
- Profound re-write of the normalisation :
- The custom attribute
CUSTOM_NORM
is completely abandoned in favour of a more spacyfic alternative - The
normalizer
pipeline modifies theNORM
attribute in place - Other pipelines can modify the
Token._.excluded
custom attribute
- The custom attribute
- EDS regex and term matchers can ignore excluded tokens during matching, effectively adding a second dimension to normalisation (choice of the attribute and possibility to skip pollution tokens regardless of the attribute)
- Matching can be performed on custom attributes more easily
- Qualifiers are regrouped together within the
edsnlp.qualifiers
submodule, the inheritance from theGenericMatcher
is dropped. edsnlp.utils.filter.filter_spans
now accepts alabel_to_remove
parameter. If set, only corresponding spans are removed, along with overlapping spans. Primary use-case: removing pseudo cues for qualifiers.- Generalise the naming convention for extensions, which keep the same name as the pipeline that created them (eg
Span._.negation
for theeds.negation
pipeline). The previous convention is kept for now, but calling it issues a warning. - The
dates
pipeline underwent some light formatting to increase robustness and fix a few issues - A new
consultation_dates
pipeline was added, which looks for dates preceded by expressions specific to consultation dates - In rule-based processing, the
terms.py
submodule is replaced bypatterns.py
to reflect the possible presence of regular expressions - Refactoring of the architecture :
- pipelines are now regrouped by type (
core
,ner
,misc
,qualifiers
) matchers
submodule containsRegexMatcher
andPhraseMatcher
classes, which interact with the normalisationmultiprocessing
submodule containsspark
andlocal
multiprocessing toolsconnectors
containsBrat
,OMOP
andLabelTool
connectorsutils
contains various utilities
- pipelines are now regrouped by type (
- Add entry points to make pipeline usable directly, removing the need to import
edsnlp.components
. - Add a
eds
namespace for components: for instance,negation
becomeseds.negation
. Using the former pipeline name still works, but issues a deprecation warning. - Add 3 score pipelines related to emergency
- Add a helper function to use a spaCy pipeline as a Spark UDF.
- Fix alignment issues in RegexMatcher
- Change the alignment procedure, dropping clumsy
numpy
dependency in favour ofbisect
- Change the name of
eds.antecedents
toeds.history
. Callingeds.antecedents
still works, but issues a deprecation warning and support will be removed in a future version. - Add a
eds.covid
component, that identifies mentions of COVID - Change the demo, to include NER components
- Major revamp of the normalisation.
- The
normalizer
pipeline now adds atomic components (lowercase
,accents
,quotes
,pollution
&endlines
) to the processing pipeline, and compiles the results into a newDoc._.normalized
extension. The latter is itself a spaCyDoc
object, wherein tokens are normalised and pollution tokens are removed altogether. Components that match on theCUSTOM_NORM
attribute process thenormalized
document, and matches are brought back to the original document using a token-wise mapping. - Update the
RegexMatcher
to use theCUSTOM_NORM
attribute - Add an
EDSPhraseMatcher
, wrapping spaCy'sPhraseMatcher
to enable matching onCUSTOM_NORM
. - Update the
matcher
andadvanced
pipelines to enable matching on theCUSTOM_NORM
attribute.
- The
- Add an OMOP connector, to help go back and forth between OMOP-formatted pandas dataframes and spaCy documents.
- Add a
reason
pipeline, that extracts the reason for visit. - Add an
endlines
pipeline, that classifies newline characters between spaces and actual ends of line. - Add possibility to annotate within entities for qualifiers (
negation
,hypothesis
, etc), ie if the cue is within the entity. Disabled by default.
- Update
dates
to remove miscellaneous bugs. - Add
isort
pre-commit hook. - Improve performance for
negation
,hypothesis
,antecedents
,family
andrspeech
by using spaCy'sfilter_spans
and ourconsume_spans
methods. - Add proposition segmentation to
hypothesis
andfamily
, enhancing results.
-
Renamed
generic
tomatcher
. This is a non-breaking change for the average user, adding the pipeline is still :nlp.add_pipe("matcher", config=dict(terms=dict(maladie="maladie")))
-
Removed
quickumls
pipeline. It was untested, unmaintained. Will be added back in a future release. -
Add
score
pipeline, andcharlson
. -
Add
advanced-regex
pipeline -
Corrected bugs in the
negation
pipeline
- Add
negation
pipeline - Add
family
pipeline - Add
hypothesis
pipeline - Add
antecedents
pipeline - Add
rspeech
pipeline - Refactor the library :
- Remove the
rules
folder - Add a
pipelines
folder, containing one subdirectory per component - Every component subdirectory contains a module defining the component, and a module defining a factory, plus any other utilities (eg
terms.py
)
- Remove the
First working version. Available pipelines :
section
sentences
normalization
pollution