Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

First sketch of the overall pipeline (+ research relevant tools/approaches) #562

Closed
jankrepl opened this issue Jan 31, 2022 · 2 comments · Fixed by #571
Closed

First sketch of the overall pipeline (+ research relevant tools/approaches) #562

jankrepl opened this issue Jan 31, 2022 · 2 comments · Fixed by #571
Labels
🗄️ database Creation and maintenance of a database of scientific literature

Comments

@jankrepl
Copy link
Contributor

jankrepl commented Jan 31, 2022

🚀 Feature

We are getting closer and closer to implementing all the relevant steps of our ETL pipeline. IMO it might be beneficial to create a first sketch of the overall pipeline and also look into tools that might be relevant.

Some ideas/to discuss
a) What logic/tool do we use to define the pipeline (raw shell script, DVC-like tools, ...)
b) How do we trigger it? (cronjob, manually, ...)
c) How do we monitor (live ideally) the progress of a running pipeline (something like github actions or jenkins do)
d) How do we test it (e.g. extending current integration tests - related #532)

@jankrepl jankrepl added the 🗄️ database Creation and maintenance of a database of scientific literature label Jan 31, 2022
@FrancescoCasalegno
Copy link
Contributor

Two potential libraries for implementing the workflow:

  • Apache Airflow – A platform to programmatically author, schedule, and monitor workflows
  • Luigi – A Python module to build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc.

@jankrepl
Copy link
Contributor Author

jankrepl commented Feb 2, 2022

I just tried to deploy Apache Airflow locally (https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html), however, I was not able to. Possibly because of low RAM, however, I don't really feel like spending more time on it. Can anybody else try? Maybe we can try on our servers?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🗄️ database Creation and maintenance of a database of scientific literature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants