Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

Luigi pipeline sketch #571

Merged
merged 85 commits into from
Mar 18, 2022
Merged

Luigi pipeline sketch #571

merged 85 commits into from
Mar 18, 2022

Conversation

jankrepl
Copy link
Contributor

@jankrepl jankrepl commented Feb 8, 2022

Fixes #562 and fixes #558

Description

Diagrams

Folder naming logic
Screenshot 2022-02-21 at 14 04 51

Pipeline steps
Screenshot 2022-02-18 at 11 56 47

Important design choices

  • The CLI has no positional arguments and instead it is using the concept of required options
  • The outputs (as shown in the diagram) will be put in the following folder $output_dir/$source/$date/
    • output_dir is specified via the CLI (--output-dir)
    • source is also specified via the CLI
    • date is a concatenation of the below two dates (using the _ separator for joining) (e.g. 2022-02_2022-02-15):
      • from-month (format %Y-%m) (specified via the CLI)
      • today (format %Y-%m-%d)
  • The PerformFilteringTask creates symlinks to those articles that made it through the filtering and puts these symlinks to filtered/. This makes it very easy for ParseTask to ingest them.
  • The mesh database needs to be provided via the CLI via --mesh-topic-db=mesh_topic_db.json -> The bbs_database parse-mesh-rdf is not included in the pipeline.
  • The code disables the central scheduler (via local_scheduler=True). It is mostly for debugging purposes. However, we might want to use the central scheduler in "production". See more info https://luigi.readthedocs.io/en/stable/central_scheduler.html
  • Luigi can be very verbose. Its logging level was hardcoded to WARNING. Feel free to modify it to have more information on what luigi is doing.

How to test?

First of all, initialize an empty database.

bbs_database init my_db.sqlite

Make sure you define the topic rules in rules_config.jsonl. If you want all articles to be accepted (debugging) just do

{"label": "accept"}

You can create a helper script launch_luigi.sh.

 bbs_database run \
     --source=pubmed \
     --from-month=2022-02 \
     --filter-config=rules_config.jsonl \
     --output-dir=pipeline_results/ \
     --mesh-topic-db=mesh_topic_db.json \
     --db-url=my_db.sqlite \
     --grobid-host=dgx1.bbp.epfl.ch \
     --grobid-port=8070 \
     $@

And you can then run luigi in 2 different ways
Dry run

./launch_luigi.sh -n
  • Shows what steps are still to do (PENDING) and which are done (COMPLETED).

Real run

./launch_luigi.sh

To speed up the pipelines (debugging purposes) consider creating the following scriptcustom_timeout.

timeout 15 $@ || ( [[ $? -eq 124 ]] && exit 0 )

You can then simply use it inside of ExternalProgramTask as the first argument. Very useful for DownloadTask. Make sure that the Python process can find it in the PATH environment variable.

TODO + To discuss

  • Write end to end tests (e.g. make the raw download as real as possible). Think of a way how to run "test" runs without having to download everything (maybe timeout but need to handle the nonzero exit code somehow)
    • The best solution was to timeout the DownloadTask so that there are not that many articles
  • Handle logging and verbosity (luigi vs our entrypoints)
    • For now we just automatically set the logging level of all entrypoints to logging.INFO. However, can be changed using the hardcoder "global" variable VERBOSITY
  • Make sure no unzipping needed for pubmed (there is a PR for this)
    • No unzipping needed for topic-extract
    • No unzipping needed for parse
  • Verify that biorxiv and medrxiv work end-to-end
  • Best way to provide optional parameters to our entrypoints (e.g. luigi.cfg)
    • Let's do this in a different PR
    • e.g. the number of processes for convert-pdf
  • Test on --db-type=postgres
  • bbs_database run --dry-run remove the parameters to make it more readable
    • The parameters are important.
  • Should we replace ExternalProgramTask with simple from bluesearch.entrypoint.database.? import run; run()? Maybe we don't need to replace it at all.
  • Refactor
  • Document
  • Try to make luigi history work
    • Let's do this in a different PR if useful
  • What happens if the pipeline does not run within 1 day?
    • Solved via a global variable that stores a unique folder for the entire process
  • Add luigi to requirements
  • Test with more realistic rules.json
  • Make it possible to stop early

Please provide here instructions on how to test the changes introduced by this PR.
(if some changes cannot be tested by automated tests)

Checklist

  • This PR refers to an issue from the issue tracker.
    (if it is not the case, please create an issue first).
  • Unit tests added.
    (if needed)
  • Documentation and whatsnew.rst updated.
    (if needed)
  • setup.py and requirements.txt updated with new dependencies.
    (if needed)
  • Type annotations added.
    (if a function is added or modified)
  • All CI tests pass.

Copy link
Contributor

@FrancescoCasalegno FrancescoCasalegno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jankrepl , thank you for this immense work!
I added a few comments, but I think the overall structure looks great!

src/bluesearch/entrypoint/database/run.py Show resolved Hide resolved
src/bluesearch/entrypoint/database/run.py Outdated Show resolved Hide resolved
src/bluesearch/entrypoint/database/run.py Outdated Show resolved Hide resolved
src/bluesearch/entrypoint/database/run.py Outdated Show resolved Hide resolved
src/bluesearch/entrypoint/database/run.py Outdated Show resolved Hide resolved
Copy link
Contributor

@FrancescoCasalegno FrancescoCasalegno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, amazing work with the refactoring!

I left two minor suggestions, then feel free to merge it for real 🔥

Comment on lines 524 to 536
if config_path:
if not pathlib.Path(config_path).exists():
raise ValueError(f"The configuration path {config_path} does not exist!")

config = luigi.configuration.get_config()
config.add_config_path(config_path)
config.reload()

if luigi_config:
config = luigi.configuration.get_config()
for param in luigi_config.split(","):
change = re.split(r"[.:]", param, maxsplit=3)
config.set(*change)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both config-path and luigi-config CLI arguments? I am not sure I understand the difference.

Also, isn't it easier/better to remove them and just rely on the env variable LUIGI_CONFIG_PATH (see luigi Docs)? This is for instance what BBP's mcar does, see e.g. how we are invoking it here in another project.
Then we would be able to run the entrypoint like this:

LUIGI_CONFIG_PATH=some/path/luigi.cfg bbs_database run [args...]

Copy link
Contributor

@EmilieDel EmilieDel Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • config-path is there to change the path to the configuration file.
  • luigi-config is to allow the user to overwrite a parameter (if needed). e.g. --luigi-config GlobalParams.source:arxiv.

If you prefer, I can remove those two arguments. Maybe just a side note related to this: I have the feeling it is much easier to test the complete pipeline with those two arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:

  • Check what happens if a user provides config-path but then LUIGI_CONFIG_PATH is also set as an environmental variable. What has precedence?
  • Rename config-path as luigi-config-path + make parser helper more explicit (help="Configuration Path." --> help="Path to Luigi configuration file.")
  • Rename luigi-config as luigi-config-args + make parser helper more explicit (help="Configuration parameters." --> help="Comma separated key-value arguments for Luigi configuration, e.g. "--luigi-config GlobalParams.source:arxiv,DownloadTask.from-month:2021-04" . Overwrites the content of Luigi configuration file (see --luigi-config-path).")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0d343d3.
For info, the luigi-config-pathhas precedence over the environment variable.

luigi.cfg Outdated Show resolved Hide resolved
@EmilieDel EmilieDel merged commit b9027ab into master Mar 18, 2022
@EmilieDel EmilieDel deleted the luigi-sketch branch March 18, 2022 09:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants