Skip to content

0.4.6

Compare
Choose a tag to compare
@MthwRobinson MthwRobinson released this 03 Feb 22:15
· 1518 commits to main since this release
014585e

0.4.6

  • Loosen the default cap threshold to 0.5.
  • Add a UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD environment variable for controlling
    the cap ratio threshold.
  • Unknown text elements are identified as Text for HTML and plain text documents.
  • Body Text styles no longer default to NarrativeText for Word documents. The style information
    is insufficient to determine that the text is narrative.
  • Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
  • Adds an Address element for capturing elements that only contain an address.
  • Suppress the UserWarning when detectron is called.
  • Checks that titles and narrative test have at least one English word.
  • Checks that titles and narrative text are at least 50% alpha characters.
  • Restricts titles to a maximum word length. Adds a UNSTRUCTURED_TITLE_MAX_WORD_LENGTH
    environment variable for controlling the max number of words in a title.
  • Updated partition_pptx to order the elements on the page