Validate Retagging Experimentation #19

agombert · 2023-10-17T14:51:42Z

Validate Retagging Experimentation

In this pull request, we validate the experimentation results for retagging using a newly corrected/retagged file. The goal is to assess the performance of the updated model trained on this new data.

New Data Source

The new data source can be found in /data/raw/retagging, specifically in the file named allMeSH_2021.2016-2021.jsonl. This dataset contains corrected and retagged annotations for various documents. Notably, it includes annotations for five key tags: "Artificial Intelligence," "HIV," "Data Collection," "Mathematics," and "Geography."

Environment Setup

Before validating the experimentation results, it is essential to set up the environment correctly. Here are the steps to follow:

On your local machine or a similar g5.12xlarge instance, ensure that you are on the main branch.
Activate your Python environment using Poetry.
Ensure that you have the latest changes by pulling from the remote repository.
Fetch the data from DVC that is required for the experimentation.
Set your Weights and Biases API key as an environment variable (WANDB_API_KEY).

Launching Preprocessing and Training

To validate the experimentation, we will perform preprocessing and training using one of the following methods:

Method: Using DVC

Navigate to the pipelines/bertmesh/ directory.
Reproduce the DVC pipeline to execute preprocessing and training.

After Training

After initiating the training, please wait until the process completes. Once training is finished, we will proceed with the evaluation of model performance.

The next steps include running examples of documents with problematic tags using the model that is currently in use and the model that you have trained. The results should demonstrate an improvement in tagging accuracy and alignment with the newly corrected and retagged dataset.

agombert · 2023-10-17T16:55:48Z

A few comments from experimentations:

here is the loss function for training set during training
It looks like there is something wrong with the training: The metrics are always 0 in the logs
The best model is the first iteration model saved
When applying any checlpoints on a random sample, once we apply the sigmoid, probabilities goes from 0.8e-4, 0.01, 0,015, 0.03, 0.055, but still "uniforms" therefore no signal is caught.
When computing the loss over those 100 examples

model	loss on sample
"best"	0.0038
current	0.0078
last iteration	0.0312

We tried the latest model from Juan with

git checkout c021da7

And we went through some evaluations modifying the evaluation_model.py to make it work. There was the sigmoid problem too and other problems such as: the variables names had to be modified.

We computed a few examples, a sample of 10 random examples (but maybe present in the training set has I don't have access to the split, and making the split again was kind of too long)... I think there is also a little mess with ids and labels at some points.

However what I saw:

There are, on the 10 examples 25 predictions, meaning it can predict something !
On the example: 'This grant is about malaria and HIV', the max proba is 0.49 (id 6424 which is "Colon, Ascending" in the config, but I have a doubt about it to be True)
On the example: 'My name is Arnault and I live in barcelona'], the max proba is 0.90 (same id as previous example)

nsorros · 2023-10-18T06:16:27Z

grants_tagger_light/training/cli_args/train_args.py

@@ -30,7 +30,7 @@ class BertMeshTrainingArguments(TrainingArguments):
        default=8
    )  # set to 256 in grants-tagger repo
    per_device_eval_batch_size: int = field(default=8)
-    gradient_accumulation_steps: int = field(default=1)
+    gradient_accumulation_steps: int = field(default=2)


we should not change the defaults ideally, just the params that get passed

put back the default to 1 👍

agombert and others added 14 commits October 5, 2023 09:22

🙈 add .DS_Store and related

3ae973e

🔧 update gradient accumulation

5b48387

🔧 increase gradient accumulation

624a321

🔧 decrease batch size

c5c1526

🔧 changing batch size

beb5da7

🔧 reset gradient accumulation to 2

98972dd

🍻 making tests on models

477407c

:disk: load data with HF

233c60b

💹 metrics for both results

52f69ef

🍻 add the label2id for getting data with different mapping

23a66aa

🍻 add the label2id as parameter if different mapping

8605854

:charts: charts for last iterations

6709d3e

🙈 add preprocessed results

dd9de53

⚙️ save model bertmesh training 102023

f4b1ea6

nsorros reviewed Oct 18, 2023

View reviewed changes

agombert added 12 commits October 19, 2023 10:45

🔧 put back default value for gradient accumulation

a6f9c4b

🔧 modify gradient accumulation in the yaml

a14315d

🐛 fix labels

8fcf4fb

🔧 add gradient accumulation in dvc lock

1095bd9

🔧 add eval each 100 steps to check the evaluation

b3b7efc

🔧 add max sample size

82e8769

📝 add logs

9cbc840

🔧 add sigmoid

d2eb48d

⚗️ make test on lower sample

2b625e2

📝 add logs and metrics

b2057a1

🔧 modify paramters for training

20308ba

💾 quit jsonl data

fe958f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate Retagging Experimentation #19

Validate Retagging Experimentation #19

agombert commented Oct 17, 2023

agombert commented Oct 17, 2023

nsorros Oct 18, 2023

agombert Oct 19, 2023

Validate Retagging Experimentation #19

Are you sure you want to change the base?

Validate Retagging Experimentation #19

Conversation

agombert commented Oct 17, 2023

Validate Retagging Experimentation

New Data Source

Environment Setup

Launching Preprocessing and Training

Method: Using DVC

After Training

agombert commented Oct 17, 2023

nsorros Oct 18, 2023

Choose a reason for hiding this comment

agombert Oct 19, 2023

Choose a reason for hiding this comment