Merge pull request #311 from 1190303125/2.0.0

modify resume, wandb and instruction
RUCAIBox · Dec 27, 2022 · ebcef12 · ebcef12
2 parents 0a3dc3f + 8cdeec0
commit ebcef12
Show file tree

Hide file tree

Showing 9 changed files with 131 additions and 28 deletions.
diff --git a/asset/basic_training.md b/asset/basic_training.md
@@ -0,0 +1,100 @@
+# Basic Training
+## config
+You may want to load your configurations in equivalent ways:
+* cmd
+* config files
+* yaml
+
+### cmd 
+You may want to change configurations in the command line like ``--xx=yy``. ``xx`` is the name of the parameters and ``yy`` is the corresponding value. for example:
+
+```bash
+python run_textbox.py --model=BART --model_path=facebook/bart-base --epochs=1
+```
+
+It's suitable for **a few temporary** modifications with cmd like:
+* ``model``
+* ``model_path``
+* ``dataset``
+* ``epochs``
+* ...
+
+### config files
+
+You can also modify configurations through the local files:
+```bash
+python run_textbox.py ... --config_files <config-file-one> <config-file-two>
+```
+
+Every config file is an additional yaml file like:
+
+```yaml
+efficient_methods: ['prompt-tuning']
+```
+It's suitable for **a large number of** modifications or **long-term** modifications with cmd like:
+* ``efficient_methods``
+* ``efficient_kwargs``
+* ...
+
+### yaml 
+
+The original configurations are in the yaml files. You can check the values there, but it's not recommended to modify the files except for **permanent** modification of the dataset. These files are in the path ``textbox\properties``:
+* ``overall.yaml``
+* ``dataset\*.yaml``
+* ``model\*yaml``
+
+
+## trainer
+
+You can choose an optimizer and scheduler through `optimizer=<optimizer-name>` and `scheduler=<scheduler-name>`. We provide a wrapper around **pytorch optimizer**, which means parameters like `epsilon` or `warmup_steps` can be specified with keyword dictionaries `optimizer_kwargs={'epsilon': ... }` and `scheduler_kwargs={'warmup_steps': ... }`. See [pytorch optimizer](https://pytorch.org/docs/stable/optim.html#algorithms) and scheduler for a complete tutorial.  <!-- TODO -->
+
+Validation frequency is introduced to validate the model **at each specific batch-steps or epoch**. Specify `valid_strategy` (either `'step'` or `'epoch'`) and `valid_steps=<int>` to adjust the pace. Specifically, the traditional train-validate paradigm is a special case with `valid_strategy=epoch` and `valid_steps=1`.
+
+`max_save=<int>` indicates **the maximal amount of saved files** (checkpoint and generated corpus during evaluation). `-1`: save every file, `0`: do not save any file, `1`: only save the file with the best score, and `n`: save both the best and the last $n−1$ files.
+
+According to ``metrics_for_best_model``, the score of the current checkpoint will be calculated, and evaluation metrics specified with ``metrics``([full list](evaluation.md)) will be chosen. **Early stopping** can be configured with `stopping_steps=<int>` and score of every checkpoint. 
+
+
+```bash
+python run_textbox.py ... --stopping_steps=8 \\
+  --metrics_for_best_model=\[\'rouge-1\', \'rouge-w\'\] \\
+  --metrics=\[\'rouge\'\]
+```
+
+You can resume from a **previous checkpoint** through ``model_path=<checkpoint_path>``.When you want to restore **all trainer parameters** like optimizer and start_epoch, you can set ``resume_training=True``. Otherwise, only **model and tokenizer** will be loaded. The script below will resume training from checkpoint in the path ``saved/BART-samsum-2022-Dec-18_20-57-47/checkpoint_best``
+
+```bash
+python run_textbox --model_path=saved/BART-samsum-2022-Dec-18_20-57-47/checkpoint_best \\
+--resume_training=True
+```
+
+Other commonly used parameters include `epochs=<int>` and `max_steps=<int>` (indicating maximum iteration of epochs and batch steps, if you set `max_steps`, `epochs` will be invalid), `learning_rate=<float>`, `train_batch_size=<int>`, `weight_decay=<bool>`, and `grad_clip=<bool>`.
+
+### Partial Experiment
+
+You can run the partial experiment with `do_train`, `do_valid`and `do_test`. You can test your pipeline and debug with `quick_test=<amount-of-data-to-load>` to load just a few examples. 
+
+The following script loads the trained model from a local path and conducts generation and evaluation without training and evaluation.
+```bash
+python run_textbox.py --model_path=saved/BART-samsum-2022-Dec-18_20-57-47/checkpoint_best \\
+--do_train=False --do_valid=False
+```
+
+## wandb
+
+If you are running your code in jupyter environments, you may want to log in by simply setting an environment variable (your key may be stored in plain text):
+
+```python
+%env WANDB_API_KEY=<your-key>
+```
+Here you can set wandb with `wandb`.
+
+If you are debugging your model, you may want to **disable W&B** with `--wandb=disabled`, and **none of the metrics** will be recorded. You can also disable **sync only** with `--wandb=offline` and enable it again with `--wandb=online` to upload to the cloud. Meanwhile, the parameter can be configured in the yaml file like:
+
+```yaml
+wandb: online
+```
+
+The local files can be uploaded by executing `wandb sync` in the command line.
+
+After configuration, you can throttle wandb prompts by defining the environment variable `export WANDB_SILENT=false`. For more information, see [documentation](docs.wandb.ai).
diff --git a/install.sh b/install.sh
@@ -35,7 +35,7 @@ esac
 
 echo "Installation may take a few minutes."
 echo -e "\033[0;32mInstalling torch ...\033[0m"
-conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
+conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
 
 echo -e "\033[0;32mInstalling requirements ...\033[0m"
 pip install -r requirements.txt
@@ -75,16 +75,6 @@ chmod +rx $F2RExpDIR/WordNet-2.0.exc.db
 pip uninstall py-rouge
 pip install rouge > /dev/null
 
-echo -e "\033[0;32mInstalling requirements (libxml) ...\033[0m"
-if [[ "$OSTYPE" == "darwin"* ]]; then
-    brewinstall libxml2 cpanminus
-    cpanm --force XML::Parser
-else
-    if [ -x "$(command -v apt-get)" ];  then sudo apt-get install libxml-parser-perl
-    elif [ -x "$(command -v yum)" ];    then sudo yum install -y "perl(XML::LibXML)"
-    else echo -e '\033[0;31mFailed to install libxml. See https://github.com/pltrdy/files2rouge/issues/9 for more information.\033[0m' && exit;
-    fi
-fi
 
 echo -e "\033[0;32mInstalling requirements (transformers) ...\033[0m"
 git clone https://github.com/RUCAIBox/transformers.git

diff --git a/instructions/RNN.md b/instructions/RNN.md
@@ -0,0 +1,16 @@
+## RNN
+
+You can train a RNN encoder-decoder with attention from scratch with this model. Three models are available:
+* RNN
+* GRU
+* LSTM
+
+You can choose them through ``model=RNN``,``model=GRU``,``model=LSTM``. Meanwhile, you can check or modify the default parameters of the model in ``textbox/property/model/rnn.yaml(gru.yaml)(lstm.yaml)``
+
+Example usage:
+
+```bash
+python run_textbox.py \
+    --model=RNN \
+    --dataset=samsum
+```
diff --git a/textbox/config/configurator.py b/textbox/config/configurator.py
@@ -262,6 +262,8 @@ def _set_default_parameters(self):
         self.setdefault('valid_strategy', 'epoch')
         self.setdefault('valid_steps', 1)
         self.setdefault('disable_tqdm', False)
+        self.setdefault('resume_training',True)
+        self.setdefault('wandb', 'online')
         self._simplify_parameter('optimizer')
         self._simplify_parameter('scheduler')
         self._simplify_parameter('src_lang')

diff --git a/textbox/properties/overall.yaml b/textbox/properties/overall.yaml
@@ -5,6 +5,7 @@ seed: 2020
 state: INFO
 reproducibility: True
 data_path: 'dataset/'
+wandb: 'online'
 
 # training settings
 epochs: 50

diff --git a/textbox/quick_start/experiment.py b/textbox/quick_start/experiment.py
@@ -37,6 +37,8 @@ def __init__(
         config_dict: Optional[Dict[str, Any]] = None,
     ):
         self.config = Config(model, dataset, config_file_list, config_dict)
+        wandb_setting = 'wandb ' + self.config['wandb']
+        os.system(wandb_setting)
         self.__extended_config = None
 
         self.accelerator = Accelerator(gradient_accumulation_steps=self.config['accumulation_steps'])
@@ -94,7 +96,8 @@ def _on_experiment_start(self, extended_config: Optional[dict]):
         self.valid_result: Optional[ResultType] = None
         self.test_result: Optional[ResultType] = None
         if config['load_type'] == 'resume':
-            self.trainer.resume_checkpoint(config['model_path'])
+            if config['resume_training']:
+                self.trainer.resume_checkpoint(config['model_path'])
             self.model.from_pretrained(config['model_path'])
 
     def _do_train_and_valid(self):

diff --git a/textbox/trainer/trainer.py b/textbox/trainer/trainer.py
@@ -364,18 +364,15 @@ def save_checkpoint(self):
     def save_generated_text(self, generated_corpus: List[str], is_valid: bool = False):
         r"""Store the generated text by our model into `self.saved_text_filename`."""
         saved_text_filename = self.saved_text_filename
-        if not is_valid:
-            self._summary_tracker.add_corpus('test', generated_corpus)
-        else:
-            path_to_save = self.saved_model_filename + '_epoch-' + str(self.timestamp.valid_epoch)
-            saved_text_filename = os.path.join(path_to_save, 'generation.txt')
-            os.makedirs(path_to_save, exist_ok=True)
+        path_to_save = self.saved_model_filename + '_epoch-' + str(self.timestamp.valid_epoch)
+        saved_text_filename = os.path.join(path_to_save, 'generation.txt')
+        os.makedirs(path_to_save, exist_ok=True)
         with open(saved_text_filename, 'w') as fout:
             for text in generated_corpus:
                 fout.write(text + '\n')
 
     def resume_checkpoint(self, resume_dir: str):
-        r"""Load the model parameters information and training information.
+        r"""Load training information.
 
         Args:
             resume_dir: the checkpoint file (specific by `model_path`).

diff --git a/textbox/utils/argument_list.py b/textbox/utils/argument_list.py
@@ -21,6 +21,7 @@
     '_hyper_tuning',  # hyper tuning
     'multi_seed',  # multiple random seed
     'romanian_postprocessing',
+    'wandb'
 ]
 
 training_parameters = [
@@ -43,7 +44,8 @@
     'weight_decay',  # common parameters
     'accumulation_steps',  # accelerator
     'disable_tqdm',  # tqdm
-    'pretrain_task'  # pretraining
+    'pretrain_task',  # pretraining
+    'resume_training'
 ]
 
 evaluation_parameters = [

diff --git a/textbox/utils/dashboard.py b/textbox/utils/dashboard.py
@@ -435,14 +435,6 @@ def add_scalar(self, tag: str, scalar_value: Union[float, int]):
         if self._is_local_main_process and not self.tracker_finished and self.axes is not None:
             wandb.log(info, step=self.axes.train_step, commit=False)
 
-    def add_corpus(self, tag: str, corpus: Iterable[str]):
-        r"""Add a corpus to summary."""
-        if tag.startswith('valid'):
-            self._current_epoch._update_metrics({'generated_corpus': '\n'.join(corpus)})
-        if self._is_local_main_process and not self.tracker_finished:
-            _corpus = wandb.Table(columns=[tag], data=pd.DataFrame(corpus))
-            wandb.log({tag: _corpus}, step=self.axes.train_step)
-
 
 root = None