Skip to content

Latest commit

 

History

History
204 lines (144 loc) · 9.14 KB

changelog.md

File metadata and controls

204 lines (144 loc) · 9.14 KB

Changelog

Unreleased

Fixed

  • Prevent section titles from capturing surrounding tokens, causing overlaps (#113)

v0.6.2 (2022-08-02)

Added

  • New SimstringMatcher matcher to perform fuzzy term matching, and algorithm parameter in terminology components and eds.matcher component
  • Makefile to install,test the application and see the documentation

Changed

  • Add consultation date pattern "CS", and False Positive patterns for dates (namely phone numbers and pagination).
  • Update the pipeline score eds.TNM. Now it is possible to return a dictionary where the results are either str or int values

Fixed

  • Add new patterns to the negation qualifier
  • Numpy header issues with binary distributed packages
  • Simstring dependency on Windows

v0.6.1 (2022-07-11)

Added

  • Now possible to provide regex flags when using the RegexMatcher
  • New ContextualMatcher pipe, aiming at replacing the AdvancedRegex pipe.
  • New as_ents parameter for eds.dates, to save detected dates as entities

Changed

  • Faster eds.sentences pipeline component with Cython
  • Bump version of Pydantic in requirements.txt to 1.8.2 to handle an incompatibility with the ContextualMatcher
  • Optimise space requirements by using .csv.gz compression for verbs

Fixed

  • eds.sentences behaviour with dot-delimited dates (eg 02.07.2022, which counted as three sentences)

v0.6.0 (2022-06-17)

Added

  • Complete revamp of the measurements detection pipeline, with better parsing and more exhaustive matching
  • Add new functionality to the method Span._.date.to_datetime() to return a result infered from context for those cases with missing information.
  • Force a batch size of 2000 when distributing a pipeline with Spark
  • New patterns to pipeline eds.dates to identify cases where only the month is mentioned
  • New eds.terminology component for generic terminology matching, using the kb_id_ attribute to store fine-grained entity label
  • New eds.cim10 terminology matching pipeline
  • New eds.drugs terminology pipeline that maps brand names and active ingredients to a unique ATC code

v0.5.3 (2022-05-04)

Added

  • Support for strings in the example utility
  • TNM detection and normalisation with the eds.TNM pipeline
  • Support for arbitrary callback for Pandas multiprocessing, with the callback argument

v0.5.2 (2022-04-29)

Added

  • Support for chained attributes in the processing pipelines
  • Colour utility with the category20 colour palette

Fixed

  • Correct a REGEX on the date detector (both nov and nov. are now detected, as all other months)

v0.5.1 (2022-04-11)

Fixed

  • Updated Numpy requirements to be compatible with the EDSPhraseMatcher

v0.5.0 (2022-04-08)

Added

  • New eds language to better fit French clinical documents and improve speed
  • Testing for markdown codeblocks to make sure the documentation is actually executable

Changed

  • Complete revamp of the date detection pipeline, with better parsing and more exhaustive matching
  • Reimplementation of the EDSPhraseMatcher in Cython, leading to a x15 speed increase

v0.4.4

  • Add measures pipeline
  • Cap Jinja2 version to fix mkdocs
  • Adding the possibility to add context in the processing module
  • Improve the speed of char replacement pipelines (accents and quotes)
  • Improve the speed of the regex matcher

v0.4.3

  • Fix regex matching on spans.
  • Add fast_parse in date pipeline.
  • Add relative_date information parsing

v0.4.2

  • Fix issue with dateparser library (see scrapinghub/dateparser#1045)
  • Fix attr issue in the advanced-regex pipelin
  • Add documentation for eds.covid
  • Update the demo with an explanation for the regex

v0.4.1

  • Added support to Koalas DataFrames in the edsnlp.processing pipe.
  • Added eds.covid NER pipeline for detecting COVID19 mentions.

v0.4.0

  • Profound re-write of the normalisation :
    • The custom attribute CUSTOM_NORM is completely abandoned in favour of a more spacyfic alternative
    • The normalizer pipeline modifies the NORM attribute in place
    • Other pipelines can modify the Token._.excluded custom attribute
  • EDS regex and term matchers can ignore excluded tokens during matching, effectively adding a second dimension to normalisation (choice of the attribute and possibility to skip pollution tokens regardless of the attribute)
  • Matching can be performed on custom attributes more easily
  • Qualifiers are regrouped together within the edsnlp.qualifiers submodule, the inheritance from the GenericMatcher is dropped.
  • edsnlp.utils.filter.filter_spans now accepts a label_to_remove parameter. If set, only corresponding spans are removed, along with overlapping spans. Primary use-case: removing pseudo cues for qualifiers.
  • Generalise the naming convention for extensions, which keep the same name as the pipeline that created them (eg Span._.negation for the eds.negation pipeline). The previous convention is kept for now, but calling it issues a warning.
  • The dates pipeline underwent some light formatting to increase robustness and fix a few issues
  • A new consultation_dates pipeline was added, which looks for dates preceded by expressions specific to consultation dates
  • In rule-based processing, the terms.py submodule is replaced by patterns.py to reflect the possible presence of regular expressions
  • Refactoring of the architecture :
    • pipelines are now regrouped by type (core, ner, misc, qualifiers)
    • matchers submodule contains RegexMatcher and PhraseMatcher classes, which interact with the normalisation
    • multiprocessing submodule contains spark and local multiprocessing tools
    • connectors contains Brat, OMOP and LabelTool connectors
    • utils contains various utilities
  • Add entry points to make pipeline usable directly, removing the need to import edsnlp.components.
  • Add a eds namespace for components: for instance, negation becomes eds.negation. Using the former pipeline name still works, but issues a deprecation warning.
  • Add 3 score pipelines related to emergency
  • Add a helper function to use a spaCy pipeline as a Spark UDF.
  • Fix alignment issues in RegexMatcher
  • Change the alignment procedure, dropping clumsy numpy dependency in favour of bisect
  • Change the name of eds.antecedents to eds.history. Calling eds.antecedents still works, but issues a deprecation warning and support will be removed in a future version.
  • Add a eds.covid component, that identifies mentions of COVID
  • Change the demo, to include NER components

v0.3.2

  • Major revamp of the normalisation.
    • The normalizer pipeline now adds atomic components (lowercase, accents, quotes, pollution & endlines) to the processing pipeline, and compiles the results into a new Doc._.normalized extension. The latter is itself a spaCy Doc object, wherein tokens are normalised and pollution tokens are removed altogether. Components that match on the CUSTOM_NORM attribute process the normalized document, and matches are brought back to the original document using a token-wise mapping.
    • Update the RegexMatcher to use the CUSTOM_NORM attribute
    • Add an EDSPhraseMatcher, wrapping spaCy's PhraseMatcher to enable matching on CUSTOM_NORM.
    • Update the matcher and advanced pipelines to enable matching on the CUSTOM_NORM attribute.
  • Add an OMOP connector, to help go back and forth between OMOP-formatted pandas dataframes and spaCy documents.
  • Add a reason pipeline, that extracts the reason for visit.
  • Add an endlines pipeline, that classifies newline characters between spaces and actual ends of line.
  • Add possibility to annotate within entities for qualifiers (negation, hypothesis, etc), ie if the cue is within the entity. Disabled by default.

v0.3.1

  • Update dates to remove miscellaneous bugs.
  • Add isort pre-commit hook.
  • Improve performance for negation, hypothesis, antecedents, family and rspeech by using spaCy's filter_spans and our consume_spans methods.
  • Add proposition segmentation to hypothesis and family, enhancing results.

v0.3.0

  • Renamed generic to matcher. This is a non-breaking change for the average user, adding the pipeline is still :

    nlp.add_pipe("matcher", config=dict(terms=dict(maladie="maladie")))
  • Removed quickumls pipeline. It was untested, unmaintained. Will be added back in a future release.

  • Add score pipeline, and charlson.

  • Add advanced-regex pipeline

  • Corrected bugs in the negation pipeline

v0.2.0

  • Add negation pipeline
  • Add family pipeline
  • Add hypothesis pipeline
  • Add antecedents pipeline
  • Add rspeech pipeline
  • Refactor the library :
    • Remove the rules folder
    • Add a pipelines folder, containing one subdirectory per component
    • Every component subdirectory contains a module defining the component, and a module defining a factory, plus any other utilities (eg terms.py)

v0.1.0

First working version. Available pipelines :

  • section
  • sentences
  • normalization
  • pollution