Skip to content

Latest commit

 

History

History
300 lines (222 loc) · 10.4 KB

CHANGELOG.md

File metadata and controls

300 lines (222 loc) · 10.4 KB

Changelog

[2.15.0] - 2023-01-23

  • Added ingestum-generate-envelope tool.
  • Added default timeout to all HTTP requests created through create_request.

[2.14.1] - 2023-01-11

  • Fixed PubMed error handling to cover for ALL request errors.

[2.14.0] - 2022-12-13

  • Added Styling and Dimensions data to Passage document type.
  • Fixed documentation warnings.

[2.13.2] - 2022-12-05

  • Fixed PubMed breaking on malformed ESearch response.
  • Fixed PDF source intermittent import error.

[2.13.1] - 2022-12-02

  • Fixed google-cloud-storage requirements versions.

[2.13.0] - 2022-11-28

  • Added PubMed new client that is more robust and matches web results.
  • Fixed PubMed text pipeline to not break on empty results.
  • Removed PubMed Entrezpy requirement.

[2.12.0] - 2022-11-07

  • Added ingestum-generate-manifest-from-xls.
  • Fixed bioRxiv leaking queries on error.
  • Fixed bioRxiv rare backend issues with best-effort retry.
  • Fixed EuropePMC rare backend issues with best-effort retry.

[2.11.4] - 2022-10-28

  • Fixed verbosity about what is currently being ingested.

[2.11.3] - 2022-10-27

  • Fixed handling of other network issues with PubMed.

[2.11.2] - 2022-10-26

  • Fixed flakiness of PubMed backend with best-effort retry.
  • Fixed handling of failures in PubMed by always failing the pipeline.

[2.11.1] - 2022-10-24

  • Fixed handling of malformed PubMed data.

[2.11.0] - 2022-10-24

  • Added support for optional PubMed API keys.

[2.10.1] - 2022-10-21

  • Fixed PubMed exiting the whole process.

[2.10.0] - 2022-10-17

  • Added standarized pagination data for PubMed.
  • Fixed all pipeline names to match their scripts name.
  • Fixed typos in tools documentation examples.

[2.9.0] - 2022-09-30

  • Added context propagation from manifests to all output documents.
  • Added a new tools page to the documentation.
  • Fixed exclude_artifact manifest option to be explictly optional.
  • Changed manifests get_source implementation to simplify new fields propagation.

[2.8.0] - 2022-09-16

  • Added source field to BaseDocument to propagate original Source URI.
  • Fixed deprecated regex patterns.

[2.7.2] - 2022-09-06

  • Fixed issue with non-stripped data in XML dates.

[2.7.1] - 2022-09-01

  • Fixed issue detecting names from GitLab files.

[2.7.0] - 2022-09-01

  • Added NLTK words dictionary for better dehyphenation.
  • Added manifest option to exclude artifacts from destination.
  • Added run_refs_only to the Engine API to reduce memory consumption.
  • Changed PyPDF2 version to 1.26.0.
  • Removed Workers internal API.

[2.6.1] - 2022-08-08

  • Added missing pipelines for bioRxiv.
  • Added missing PubMed date formats.
  • Fixed logging date formats that are not properly handled.

[2.6.0] - 2022-07-25

  • Added standarized pagination data for Biorxiv.
  • Added standarized pagination data for EuropePMC
  • Changed Pubmed search query to querying without any search term.
  • Changed Biorxiv search query to querying without any search term.
  • Changed EuropePMC search query to querying without any search term.

[2.5.0] - 2022-07-11

  • Added info field to ingestum-envelope's json output.
  • Added Pubmed's transformers to keep and process raw data separetely.
  • Added Biorxiv's transformers to keep and process raw data separetely.
  • Added EuropePMC's transformers to keep and process raw data separetely.
  • Added entrez_date to EuropePMC publication document.
  • Fixed chardet version to 3.0.4.
  • Changed EuropePMC date filtering to entrez_date.

[2.4.0] 2022-06-27

  • Added figures and tables data in PubMed's full text.
  • Added ingestum-install-plugins to manage dependencies without re-installing Ingestum.
  • Fixed breaking EuropePMC ingestion on individual articles errors.
  • Changed EuropePMC to a bigger PageSize for pagination.

[2.3.0] 2022-06-13

  • Fixed publication_date for PubMed based on PubModel data.
  • Fixed removal of puntuations from PubMed full-text data.
  • Fixed journal data for preprints in EuropePMC.
  • Fixed regression with abstracts recursive tags in EuropePMC.
  • Added ingestum-envelope to process full ingestion envelopes.

[2.2.2] 2022-05-31

  • Fixed default value for lists arguments in ingestum-generate-manifest.
  • Fixed handling of missing objects in DOCX transformer.

[2.2.1] 2022-05-21

  • Fixed EuropePMC infinite loop condition.

[2.2.0] 2022-05-19

  • Added support for abstract_title search, sort and direction for bioRxiv.
  • Added support for excluding sensitive arguments from documents context.
  • Fixed EuropePMC abstract formatting.
  • Fixed EuropePMC title formatting.
  • Fixed EuropePMC publication_type formatting.
  • Fixed leaking queries and search terms to documents context.
  • Fixed leaking queries and search terms to debug logs.

[2.1.0] - 2022-05-04

  • Added support for PPTX as a source.
  • Added memory instrumentation to ingestum-manifest.
  • Fixed concepts, examples and typos in our documentation.
  • Fixed scripts default arguments values to match sources default values.
  • Fixed consistency of beautifulsoup's find_all method usage.
  • Fixed ingestum-pipeline default argument value for lists.

[2.0.3] - 2022-04-18

  • Fixed PubMed abstracts to remove alternative language versions.

[2.0.2] - 2022-04-11

  • Fixed click version to 7.1.2.
  • Changed black version back to 20.8b1.
  • Changed typing-extensions version back to 3.7.4.3.
  • Changed requests version to 2.25.1.

[2.0.1] - 2022-04-07

  • Added journal abbreviation to the publication document type.
  • Fixed bioRxiv missing publication date.
  • Fixed bioRxiv missing publication type.
  • Fixed bioRxiv breaking on malformed data.
  • Fixed bioRxiv abstract subtitles formatting.
  • Fixed PDF page-number parsing logic.
  • Fixed LitCovid tests.
  • Changed black version to 22.3.0.
  • Changed typing-extensions version to 3.10.0.2.

[2.0.0] - 2021-12-16

  • Added support for multiple source locations.
  • Added support for multiple artifacts destinations.
  • Added support for Google data lake.
  • Added support for publications as a document type.
  • Added support for LitCovid as a source.
  • Added support for bioRxiv as a source.
  • Added support for medRxiv as a source.
  • Added support for EuropePMC as a source.
  • Added support for multiple plugins directories.
  • Added support for multithreaded processing.
  • Added table extraction transformers based on text markers.
  • Added hybrid PDF ingestion.
  • Added hybrid PDF tables extraction.
  • Added dynamic argument parsing to ingestum-pipeline tool.
  • Added from_date and to_date arguments to all literature monitoring transformers.
  • Added articles count parameter to Reddit transformer.
  • Added ingestum-generate-manifest tool.
  • Fixed CSV parsing issues.
  • Fixed non-ASCII characters in output documents.
  • Fixed sub-classing document types in plugins.
  • Fixed PubMed transformers to handle missing hours attribute.
  • Fixed ProQuest unicode issues.
  • Fixed searching on documentation.
  • Changed output documents to be unformatted to save storage space.
  • Changed camelot-py version to 0.10.1.
  • Changed praw version to 7.4.0.
  • Changed Twitter back-end to tweepy.
  • Changed pipeline names from 'excel' to 'xls'.
  • Changed plugins folder structure to simplify plugin manager.
  • Changed base operating system to Ubuntu 20.04 LTS.
  • Removed CSV document type and related transformers.

[1.3.0] - 2021-07-01

  • Added a new "layout" argument to PDFSourceToTextDocument* transformers.
  • Added Reddis source and transformer.
  • Added "origin" attribute to all document types.
  • Added support for recursive conditionals.
  • Added tool to merge collection documents.
  • Added support for Docker as a development environment.
  • Added more details to the API documentation.
  • Fixed transformers, documents and conditionals sub-classing.
  • Fixed missing mimetype for .xlsx files.
  • Fixed import attempts for unnecessary files in plugin directories.
  • Fixed loading and deserializing manifest sources plugins.
  • Fixed issue with pipeline messing with transformers when used multiple times.
  • Changed logging format to JSON.
  • Changed tests to pytest.
  • Changed PubmedSourceCreate* transformer to use the official entrezpy library.
  • Changed PubmedSourceCreate* "hours" argument to be optional.
  • Changed PubmedSourceCreate* to use EDAT dates by default.
  • Changed to distro-packaged LibreOffice installation.
  • Changed artifacts IDs to be randomized.

[1.2.1] - 2021-03-29

  • Fixed extracting sources mimetypes.
  • Fixed OR vs AND syntax in PubMed queries.
  • Fixed deserializing pipelines with recursive transformers.

[1.2.0] - 2021-03-19

  • Added support for converting HTML to Image source.
  • Added support for converting XLS to Image source.
  • Added support for converting DOCX to Image source.
  • Added support for unspecified number of pages in PDF transformers.
  • Added support for recursive transformers, e.g. for collections of collections.
  • Added support for context metadata in all document formats.
  • Added support for types field in Passage document metadata.
  • Added support for toolbox containers as development environment.
  • Added ingestum-migrate for existing ingestum documents, e.g for testing outputs.
  • Added debug logging calls to all transformers.
  • Added debug logging calls to all time critical pipeline steps.
  • Fixed opening the same PDF repeatedly for metadata.
  • Fixed handling empty PDF pages.
  • Fixed source downloading cache.
  • Changed to a richer PubMed API.

[1.1.0] - 2021-02-11

  • Added support for DOCX sources.
  • Added support for PubMed sources.
  • Added support for crop area in PDF source transformer.
  • Added support for table extraction from images.
  • Added support for PDF unstructured forms.
  • Added option to disable PDF columns layot detection.
  • Added more examples to documentation.
  • Added filters to XML source transformer to reduce noise in output text.
  • Fixed PDF column extraction for more complex layouts.
  • Fixed error on Text document tokenizer not handling empty lists.
  • Fixed error message for non-existent files.
  • Fixed errors while ingesting PDF with protections enabled.
  • Fixed errors while ingesting PDF with watermarks.
  • Fixed errors while ingesting PDF with noisy shape data.
  • Fixed errors with HTML parser by switching to lxml.
  • Fixed memory consumption in PDF OCR transformers.
  • Updated documentation.
  • Updated pyexcel requirement to v0.6.6.

[1.0.2] - 2020-12-10

  • Fixed paragraph detection for PDF and OCR pipelines.
  • Fixed dealing with PDF images extraction with noisy data.

[1.0.1] - 2020-11-30

  • Changed version of requests to 2.24.0.

[1.0.0] - 2020-11-30

  • Initial release.