Run pre-commit

MantisAI · Sep 12, 2023 · 5cb92d4 · 5cb92d4
1 parent f7d70b0
commit 5cb92d4
Show file tree

Hide file tree

Showing 10 changed files with 24 additions and 19 deletions.
diff --git a/README.md b/README.md
@@ -72,12 +72,12 @@ in square brackets the commands that are not implemented yet
 ## ⚙️Preprocess
 
 This process is optional to run, since it can be directly managed by the `Train` process.
-- If you run it manually, it will store the data in local first, which can help if you need finetune in the future, 
+- If you run it manually, it will store the data in local first, which can help if you need finetune in the future,
 rerun, etc.
-- If not run it, the `train` step will preprocess and then run, without any extra I/O operations on disk, 
+- If not run it, the `train` step will preprocess and then run, without any extra I/O operations on disk,
 which may add latency depending on the infrastructure.
 
-It requires data in `jsonl` format for parallelization purposes. In `data/raw` you can find `allMesH_2021.jsonl` 
+It requires data in `jsonl` format for parallelization purposes. In `data/raw` you can find `allMesH_2021.jsonl`
 already prepared for the preprocessing step.
 
 If your data is in `json` format, trasnform it to `jsonl` with tools as `jq` or using Python.
@@ -96,9 +96,9 @@ your own data under development.
 ### Preprocessing bertmesh
 
 ```
- Usage: grants-tagger preprocess mesh [OPTIONS] DATA_PATH SAVE_TO_PATH                                                                                                                                             
-                                      MODEL_KEY                                                                                                                                                                    
-                                                                                                                                                                                                                   
+ Usage: grants-tagger preprocess mesh [OPTIONS] DATA_PATH SAVE_TO_PATH
+                                      MODEL_KEY
+
 ╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │ *    data_path         TEXT  Path to mesh.jsonl [default: None] [required]                                                                                                                                      │
 │ *    save_to_path      TEXT  Path to save the serialized PyArrow dataset after preprocessing [default: None] [required]                                                                                         │
@@ -122,7 +122,7 @@ The command will train a model and save it to the specified path. Currently we s
 
 ### bertmesh
 ```
- Usage: grants-tagger train bertmesh [OPTIONS] MODEL_KEY DATA_PATH                                                                                                                                                 
+ Usage: grants-tagger train bertmesh [OPTIONS] MODEL_KEY DATA_PATH
 
 ╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │ *    model_key      TEXT  Pretrained model key. Local path or HF location [default: None] [required]                                                                                                            │
@@ -143,7 +143,7 @@ The command will train a model and save it to the specified path. Currently we s
 
 #### About `model_key`
 `model_key` possible values are:
-- A HF location for a pretrained / finetuned model 
+- A HF location for a pretrained / finetuned model
 - "" to load a model by default and train from scratch (`microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract`)
 
 #### About `sharding`
@@ -152,7 +152,7 @@ to improve performance on big datasets. To enable it:
 - set shards to something bigger than 1 (Recommended: same number as cpu cores)
 
 #### Other arguments
-Besides those arguments, feel free to add any other TrainingArgument from Hugging Face or Wand DB. 
+Besides those arguments, feel free to add any other TrainingArgument from Hugging Face or Wand DB.
 This is the example used to train reaching a ~0.6 F1, also available at `examples/train_by_epochs.sh`
 ```commandline
 grants-tagger train bertmesh \
@@ -365,7 +365,7 @@ and you would be able to run `grants_tagger preprocess epmc_mesh ...`
 
 ## 🚦 Test
 
-To run the test you need to have installed the `dev` dependencies first. 
+To run the test you need to have installed the `dev` dependencies first.
 This is done by running `poetry install --with dev` after you are in the sell (`poetry shell`)
 
 Run tests with `pytest`. If you want to write some additional tests,

diff --git a/examples/augment.sh b/examples/augment.sh
@@ -1,3 +1,3 @@
 grants-tagger augment mesh [FOLDER_AFTER_PREPROCESSING] [SET_YOUR_OUTPUT_FOLDER_HERE] \
   --min-examples 25 \
-  --concurrent-calls 25
+  --concurrent-calls 25
diff --git a/examples/augment_specific_tags.sh b/examples/augment_specific_tags.sh
@@ -2,4 +2,4 @@
 grants-tagger augment mesh [FOLDER_AFTER_PREPROCESSING] [SET_YOUR_OUTPUT_FOLDER_HERE] \
   --tags-file-path tags_to_augment.txt \
   --examples 25 \
-  --concurrent-calls 25
+  --concurrent-calls 25
diff --git a/examples/preprocess_splitting_by_fract.sh b/examples/preprocess_splitting_by_fract.sh
@@ -1,2 +1,2 @@
 grants-tagger preprocess mesh data/raw/allMeSH_2021.jsonl [SET_YOUR_OUTPUT_FOLDER_HERE] '' \
-  --test-size 0.05
+  --test-size 0.05
diff --git a/examples/preprocess_splitting_by_rows.sh b/examples/preprocess_splitting_by_rows.sh
@@ -1,2 +1,2 @@
 grants-tagger preprocess mesh data/raw/allMeSH_2021.jsonl [SET_YOUR_OUTPUT_FOLDER_HERE] '' \
-  --test-size 25000
+  --test-size 25000
diff --git a/examples/preprocess_splitting_by_years.sh b/examples/preprocess_splitting_by_years.sh
@@ -1,4 +1,4 @@
 grants-tagger preprocess mesh data/raw/allMeSH_2021.jsonl [SET_YOUR_OUTPUT_FOLDER_HERE] '' \
   --test-size 25000 \
   --train-years 2016,2017,2018,2019 \
-  --test-years 2020,2021
+  --test-years 2020,2021
diff --git a/examples/resume_train_by_epoch.sh b/examples/resume_train_by_epoch.sh
@@ -34,4 +34,4 @@ grants-tagger train bertmesh \
     --save_strategy epoch \
     --wandb_project wellcome-mesh \
     --wandb_name test-train-all \
-    --wandb_api_key ${WANDB_API_KEY}
+    --wandb_api_key ${WANDB_API_KEY}
diff --git a/examples/resume_train_by_steps.sh b/examples/resume_train_by_steps.sh
@@ -36,4 +36,4 @@ grants-tagger train bertmesh \
     --save_steps 10000 \
     --wandb_project wellcome-mesh \
     --wandb_name test-train-all \
-    --wandb_api_key ${WANDB_API_KEY}
+    --wandb_api_key ${WANDB_API_KEY}
diff --git a/grants_tagger_light/augmentation/prompt.template b/grants_tagger_light/augmentation/prompt.template
@@ -9,4 +9,4 @@ ABSTRACT:
 {ABSTRACT}
 
 TOPIC:
-{TOPIC}
+{TOPIC}
diff --git a/scripts/create_xlinear_bertmesh_comparison_csv.py b/scripts/create_xlinear_bertmesh_comparison_csv.py
@@ -130,7 +130,12 @@ def create_comparison_csv(
     active_grants = active_grants[~active_grants["Synopsis"].isna()]
     active_grants.sample(frac=1)
     active_grants_sample = active_grants.iloc[:active_portfolio_sample]
-    active_grants_sample = pd.DataFrame({"abstract": active_grants_sample["Synopsis"], "Reference": active_grants_sample["Reference"]})
+    active_grants_sample = pd.DataFrame(
+        {
+            "abstract": active_grants_sample["Synopsis"],
+            "Reference": active_grants_sample["Reference"],
+        }
+    )
     active_grants_sample["active_portfolio"] = 1
     active_grants.drop_duplicates(subset="abstract", inplace=True)
     grants_sample = pd.concat([grants_sample, active_grants_sample])
-Original file line number
+Diff line change
@@ Expand Up / @@ -9,4 +9,4 @@ ABSTRACT: @@
     {ABSTRACT}
     TOPIC:
-    {TOPIC}
+    {TOPIC}