Skip to content

Commit

Permalink
Run pre-commit
Browse files Browse the repository at this point in the history
  • Loading branch information
Ubuntu committed Sep 12, 2023
1 parent f7d70b0 commit 5cb92d4
Show file tree
Hide file tree
Showing 10 changed files with 24 additions and 19 deletions.
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,12 +72,12 @@ in square brackets the commands that are not implemented yet
## ⚙️Preprocess

This process is optional to run, since it can be directly managed by the `Train` process.
- If you run it manually, it will store the data in local first, which can help if you need finetune in the future,
- If you run it manually, it will store the data in local first, which can help if you need finetune in the future,
rerun, etc.
- If not run it, the `train` step will preprocess and then run, without any extra I/O operations on disk,
- If not run it, the `train` step will preprocess and then run, without any extra I/O operations on disk,
which may add latency depending on the infrastructure.

It requires data in `jsonl` format for parallelization purposes. In `data/raw` you can find `allMesH_2021.jsonl`
It requires data in `jsonl` format for parallelization purposes. In `data/raw` you can find `allMesH_2021.jsonl`
already prepared for the preprocessing step.

If your data is in `json` format, trasnform it to `jsonl` with tools as `jq` or using Python.
Expand All @@ -96,9 +96,9 @@ your own data under development.
### Preprocessing bertmesh

```
Usage: grants-tagger preprocess mesh [OPTIONS] DATA_PATH SAVE_TO_PATH
MODEL_KEY
Usage: grants-tagger preprocess mesh [OPTIONS] DATA_PATH SAVE_TO_PATH
MODEL_KEY
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * data_path TEXT Path to mesh.jsonl [default: None] [required] │
│ * save_to_path TEXT Path to save the serialized PyArrow dataset after preprocessing [default: None] [required] │
Expand All @@ -122,7 +122,7 @@ The command will train a model and save it to the specified path. Currently we s

### bertmesh
```
Usage: grants-tagger train bertmesh [OPTIONS] MODEL_KEY DATA_PATH
Usage: grants-tagger train bertmesh [OPTIONS] MODEL_KEY DATA_PATH
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * model_key TEXT Pretrained model key. Local path or HF location [default: None] [required] │
Expand All @@ -143,7 +143,7 @@ The command will train a model and save it to the specified path. Currently we s

#### About `model_key`
`model_key` possible values are:
- A HF location for a pretrained / finetuned model
- A HF location for a pretrained / finetuned model
- "" to load a model by default and train from scratch (`microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract`)

#### About `sharding`
Expand All @@ -152,7 +152,7 @@ to improve performance on big datasets. To enable it:
- set shards to something bigger than 1 (Recommended: same number as cpu cores)

#### Other arguments
Besides those arguments, feel free to add any other TrainingArgument from Hugging Face or Wand DB.
Besides those arguments, feel free to add any other TrainingArgument from Hugging Face or Wand DB.
This is the example used to train reaching a ~0.6 F1, also available at `examples/train_by_epochs.sh`
```commandline
grants-tagger train bertmesh \
Expand Down Expand Up @@ -365,7 +365,7 @@ and you would be able to run `grants_tagger preprocess epmc_mesh ...`

## 🚦 Test

To run the test you need to have installed the `dev` dependencies first.
To run the test you need to have installed the `dev` dependencies first.
This is done by running `poetry install --with dev` after you are in the sell (`poetry shell`)

Run tests with `pytest`. If you want to write some additional tests,
Expand Down
2 changes: 1 addition & 1 deletion examples/augment.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
grants-tagger augment mesh [FOLDER_AFTER_PREPROCESSING] [SET_YOUR_OUTPUT_FOLDER_HERE] \
--min-examples 25 \
--concurrent-calls 25
--concurrent-calls 25
2 changes: 1 addition & 1 deletion examples/augment_specific_tags.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
grants-tagger augment mesh [FOLDER_AFTER_PREPROCESSING] [SET_YOUR_OUTPUT_FOLDER_HERE] \
--tags-file-path tags_to_augment.txt \
--examples 25 \
--concurrent-calls 25
--concurrent-calls 25
2 changes: 1 addition & 1 deletion examples/preprocess_splitting_by_fract.sh
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
grants-tagger preprocess mesh data/raw/allMeSH_2021.jsonl [SET_YOUR_OUTPUT_FOLDER_HERE] '' \
--test-size 0.05
--test-size 0.05
2 changes: 1 addition & 1 deletion examples/preprocess_splitting_by_rows.sh
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
grants-tagger preprocess mesh data/raw/allMeSH_2021.jsonl [SET_YOUR_OUTPUT_FOLDER_HERE] '' \
--test-size 25000
--test-size 25000
2 changes: 1 addition & 1 deletion examples/preprocess_splitting_by_years.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
grants-tagger preprocess mesh data/raw/allMeSH_2021.jsonl [SET_YOUR_OUTPUT_FOLDER_HERE] '' \
--test-size 25000 \
--train-years 2016,2017,2018,2019 \
--test-years 2020,2021
--test-years 2020,2021
2 changes: 1 addition & 1 deletion examples/resume_train_by_epoch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,4 @@ grants-tagger train bertmesh \
--save_strategy epoch \
--wandb_project wellcome-mesh \
--wandb_name test-train-all \
--wandb_api_key ${WANDB_API_KEY}
--wandb_api_key ${WANDB_API_KEY}
2 changes: 1 addition & 1 deletion examples/resume_train_by_steps.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,4 @@ grants-tagger train bertmesh \
--save_steps 10000 \
--wandb_project wellcome-mesh \
--wandb_name test-train-all \
--wandb_api_key ${WANDB_API_KEY}
--wandb_api_key ${WANDB_API_KEY}
2 changes: 1 addition & 1 deletion grants_tagger_light/augmentation/prompt.template
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ ABSTRACT:
{ABSTRACT}

TOPIC:
{TOPIC}
{TOPIC}
7 changes: 6 additions & 1 deletion scripts/create_xlinear_bertmesh_comparison_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,12 @@ def create_comparison_csv(
active_grants = active_grants[~active_grants["Synopsis"].isna()]
active_grants.sample(frac=1)
active_grants_sample = active_grants.iloc[:active_portfolio_sample]
active_grants_sample = pd.DataFrame({"abstract": active_grants_sample["Synopsis"], "Reference": active_grants_sample["Reference"]})
active_grants_sample = pd.DataFrame(
{
"abstract": active_grants_sample["Synopsis"],
"Reference": active_grants_sample["Reference"],
}
)
active_grants_sample["active_portfolio"] = 1
active_grants.drop_duplicates(subset="abstract", inplace=True)
grants_sample = pd.concat([grants_sample, active_grants_sample])
Expand Down

0 comments on commit 5cb92d4

Please sign in to comment.