Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

Luigi pipeline sketch #571

Merged
merged 85 commits into from
Mar 18, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
2b4e454
First draft of the entrypoint
jankrepl Feb 8, 2022
886c306
Write initial test
jankrepl Feb 8, 2022
c9c2d49
First kind of working version/sketch
jankrepl Feb 8, 2022
4e561b7
Make download task work
jankrepl Feb 8, 2022
8385284
Merge branch 'master' into luigi-sketch
jankrepl Feb 9, 2022
4ae5fa9
Implement unzipping logic
jankrepl Feb 9, 2022
daeb4d7
Implement dry run
jankrepl Feb 10, 2022
621c6bb
Turn positionals into required options
jankrepl Feb 10, 2022
d3f97ac
Implement TopicExtractTask
jankrepl Feb 10, 2022
a4710eb
Implement topicfiltertask
jankrepl Feb 10, 2022
a1b0e86
Add create symlinks task
jankrepl Feb 10, 2022
2d7c068
Implement convertpdf task
jankrepl Feb 10, 2022
2a37dea
Implement parse task
jankrepl Feb 10, 2022
e34439b
Implement AddTask
jankrepl Feb 10, 2022
06df235
Improve logic in custom compleete
jankrepl Feb 14, 2022
973b284
Handle keyboardinterrupt in topic-extract
jankrepl Feb 14, 2022
4da3450
Timeout experiments
jankrepl Feb 14, 2022
8ffbb61
Remove keyboardinterrupt catching
jankrepl Feb 14, 2022
5b64767
Fix typo and wrong task dependency
jankrepl Feb 14, 2022
b7ef2ca
Add small changes
jankrepl Feb 14, 2022
ccd88e0
Add some docstrings and annotations
jankrepl Feb 14, 2022
5ad1a75
Fix the unit test
jankrepl Feb 14, 2022
d186281
Write additional unit test
jankrepl Feb 14, 2022
91458e4
Make black happy
jankrepl Feb 14, 2022
7387257
ADd pending to the check
jankrepl Feb 14, 2022
3a49d3e
Configure output capturing
jankrepl Feb 14, 2022
8d982fb
Merge branch 'master' into luigi-sketch
jankrepl Feb 15, 2022
4cc4208
Add local timeout hack
jankrepl Feb 15, 2022
7b55f2e
Only use local-scheduler
jankrepl Feb 15, 2022
523c186
Merge branch 'master' into luigi-sketch
jankrepl Feb 15, 2022
9f23d4c
Turn entrypoint verbosity into global variable
jankrepl Feb 15, 2022
5de2d95
Merge branch 'master' into luigi-sketch
jankrepl Feb 15, 2022
4982824
Fix source2parse and also postgres complete check
jankrepl Feb 15, 2022
af47a56
Add luigi to requirements
jankrepl Feb 15, 2022
d937dd1
Run black
jankrepl Feb 15, 2022
8623588
Correct flake8 mistakes
jankrepl Feb 15, 2022
db5768d
Fix isort problems
jankrepl Feb 15, 2022
9b19dc6
Fix typing
jankrepl Feb 15, 2022
03b433b
Add more docstrings
jankrepl Feb 15, 2022
91c0c71
Rerun formatting
jankrepl Feb 15, 2022
b4d3d7b
Nasty global variable date handling
jankrepl Feb 15, 2022
fd3f24d
Dont consider minutes and seconds
jankrepl Feb 15, 2022
7b56469
Rename task to be more versatile
jankrepl Feb 15, 2022
fb082c1
Write pseudocode for pubmed peformfilter
jankrepl Feb 15, 2022
4857f83
Merge branch 'master' into luigi-sketch
jankrepl Feb 15, 2022
1179531
Dont run unzipping for pubmed
jankrepl Feb 15, 2022
9324789
WIP-performfiltering task
jankrepl Feb 15, 2022
69187d0
Implement subtree removal logic
jankrepl Feb 17, 2022
55df185
Make luigi less verbose
jankrepl Feb 17, 2022
a5eff8f
Merge branch 'master' into luigi-sketch
jankrepl Feb 17, 2022
f629ae1
Fix the immortal bug
jankrepl Feb 17, 2022
9702aa3
Make sure PerformFilteringTask zips pubmed-article-set
jankrepl Feb 17, 2022
857541d
Run formatting
jankrepl Feb 17, 2022
81cad76
Make sure unit tests are passing
jankrepl Feb 17, 2022
1ad559c
Update sphinx
jankrepl Feb 17, 2022
36b08f4
Fix linting
jankrepl Feb 17, 2022
6efc984
Update docstring
jankrepl Feb 17, 2022
164d033
Undo changes in tox.ini
jankrepl Feb 17, 2022
8f1b896
Add luigi config
jankrepl Feb 18, 2022
83006c9
Fix isort
jankrepl Feb 18, 2022
08f7c77
Try to fix sphinx warning
jankrepl Feb 18, 2022
45fadf2
Remove custom_timeout from the source code
jankrepl Feb 18, 2022
09b338e
Add type annotations everywhere
jankrepl Feb 18, 2022
eb8bce6
Add custom identifier logic
jankrepl Feb 21, 2022
2e9e576
Reformat
jankrepl Feb 21, 2022
3c17bb3
Break the line
jankrepl Feb 21, 2022
7c50ca4
Fix typos
jankrepl Feb 21, 2022
927fc89
Add forgotten bracket
jankrepl Feb 21, 2022
25e5fed
Add recursive enumeration to pubmed
jankrepl Feb 22, 2022
0d5f772
Add logging for each element in file
jankrepl Feb 22, 2022
cdc0d87
Skip download for arXiv articles with broken ID or version (#586)
FrancescoCasalegno Feb 21, 2022
c9dcb39
Add separate try except blocks for each source
jankrepl Feb 23, 2022
587e239
Fix bug
jankrepl Feb 23, 2022
03e8af1
Format nicely
jankrepl Feb 23, 2022
95847c5
Add the possibility of early stoppping
jankrepl Feb 24, 2022
5e255d4
Small modification
jankrepl Feb 24, 2022
93569a3
Add iffy tests
jankrepl Feb 24, 2022
8620c87
Run formatter
jankrepl Feb 24, 2022
efa6821
Ignore a luigi warning
jankrepl Feb 24, 2022
7c81595
Use context manager
jankrepl Mar 8, 2022
ff183c0
Merge branch 'master' into luigi-sketch
jankrepl Mar 9, 2022
95827a2
Move luigi parameters to a config file
jankrepl Mar 9, 2022
0ff13e4
Remove requires/inherits decorator
EmilieDel Mar 15, 2022
bd77681
Fix linting and add header luigi.cfg
EmilieDel Mar 18, 2022
0d343d3
Add more info about run arguments
EmilieDel Mar 18, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
version = bluesearch.__version__

# -- General configuration ---------------------------------------------------
suppress_warnings = ["ref.ref"] # because of luigi.util.requires

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
Expand Down
1 change: 1 addition & 0 deletions docs/source/api/bluesearch.entrypoint.database.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Submodules
bluesearch.entrypoint.database.parent
bluesearch.entrypoint.database.parse
bluesearch.entrypoint.database.parse_mesh_rdf
bluesearch.entrypoint.database.run
bluesearch.entrypoint.database.schemas
bluesearch.entrypoint.database.topic_extract
bluesearch.entrypoint.database.topic_filter
Expand Down
7 changes: 7 additions & 0 deletions docs/source/api/bluesearch.entrypoint.database.run.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
bluesearch.entrypoint.database.run module
=========================================

.. automodule:: bluesearch.entrypoint.database.run
:members:
:undoc-members:
:show-inheritance:
44 changes: 44 additions & 0 deletions luigi.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
;Blue Brain Search is a text mining toolbox focused on scientific use cases.
;
;Copyright (C) 2020 Blue Brain Project, EPFL.
;
;This program is free software: you can redistribute it and/or modify
;it under the terms of the GNU Lesser General Public License as published by
;the Free Software Foundation, either version 3 of the License, or
;(at your option) any later version.
;
;This program is distributed in the hope that it will be useful,
;but WITHOUT ANY WARRANTY; without even the implied warranty of
;MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
;GNU Lesser General Public License for more details.
;
;You should have received a copy of the GNU Lesser General Public License
;along with this program. If not, see <https://www.gnu.org/licenses/>.

[core]
autoload_range = true
log_level = INFO
local_scheduler = True

[GlobalParams]
source = pubmed

[DownloadTask]
from_month = 2021-12
output_dir = luigi-pipeline
identifier =
; emtpy string is considered default value

[TopicExtractTask]
mesh_topic_db = luigi-pipeline/mesh_topic_db.json

[TopicFilterTask]
filter_config = luigi-pipeline/filter-config.jsonl

[ConvertPDFTask]
grobid_host = 0.0.0.0
grobid_port = 8070

[AddTask]
db_url = luigi-pipeline/my-db.db
db_type = sqlite
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ ipython==7.31.1
ipywidgets==7.6.3
jupyterlab==3.0.17
langdetect==1.0.9
luigi==3.0.3
mashumaro==3.0
numpy==1.21.0
pandas==1.3.0
Expand Down
5 changes: 3 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,10 +60,11 @@
"ipywidgets",
"jupyterlab>=3",
"langdetect",
"numpy>=1.20.1",
"pandas>=1",
"luigi",
# Serialization framework on top of dataclasses, e.g. 'Article' to and from JSON.
"mashumaro>=3.0",
"numpy>=1.20.1",
"pandas>=1",
"pg8000",
"python-dotenv",
"requests",
Expand Down
1 change: 1 addition & 0 deletions src/bluesearch/entrypoint/database/add.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ def run(
sentence_mappings = []

for article in articles:
logger.info(f"Processing {article.uid}")

article_mapping = {
"article_id": article.uid,
Expand Down
6 changes: 6 additions & 0 deletions src/bluesearch/entrypoint/database/parent.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
init,
parse,
parse_mesh_rdf,
run,
topic_extract,
topic_filter,
)
Expand Down Expand Up @@ -72,6 +73,11 @@ def main(argv: Sequence[str] | None = None) -> int:
init_parser=parse.init_parser,
run=parse.run,
),
"run": Cmd(
help="Run the pipeline.",
init_parser=run.init_parser,
run=run.run,
),
"topic-extract": Cmd(
help="Extract topic of article(s).",
init_parser=topic_extract.init_parser,
Expand Down
Loading