-
Notifications
You must be signed in to change notification settings - Fork 3
Workflow
Ricardo Oliveira edited this page Jun 20, 2022
·
5 revisions
The harvesting is composed as the following :
- For each publisher, we fetch all new articles from their associated source. Some publishers store them in FTP servers, whereas other do it with a REST API.
- Everything is then pushed to our S3 server in the publisher's bucket.
- The publisher-specific parsing gets all information from the article to have it in our internal format.
- The generic parsing restructures the internal format and cleans some data.
- The enhancement adds additional fields based on existing fields. For example adding the creation date of parsed article.
- The enrichment adds additional fields based on external sources. For example adding Arxiv categories.
- The final article is then verified to ensure that is is compliant with the JSON Schema. The JSON article is then pushed to PSQL.