Workflow

The harvesting is composed as the following :

For each publisher, we fetch all new articles from their associated source. Some publishers store them in FTP servers, whereas other do it with a REST API.
Everything is then pushed to our S3 server in the publisher's bucket.
The publisher-specific parsing gets all information from the article to have it in our internal format.
The generic parsing restructures the internal format and cleans some data.
The enhancement adds additional fields based on existing fields. For example adding the creation date of parsed article.
The enrichment adds additional fields based on external sources. For example adding Arxiv categories.
The final article is then verified to ensure that is is compliant with the JSON Schema. The JSON article is then pushed to PSQL.

workflow

Provide feedback