Add and streamline documentation for evaluations (#3)

* Add info messages to printout of experiments/evaluate.py * Update evaluation documentation (preliminary first pass) * Move details about baselines to baselines folder * Move counterfact details to individual folder * Add details about evaluation dump format * Reorganize instructions into own file * Revert "Reorganize instructions into own file" This reverts commit 7acc81d. * Further streamline docs * Further smooth errors in README * Add note on cross-platform compatibility * Fix nasty space bug * Add clarification to run_id * Add argparse notes to summarize.py
kmeng01 · Mar 6, 2022 · 789fab0 · 789fab0
1 parent 034a18a
commit 789fab0
Show file tree

Hide file tree

Showing 8 changed files with 124 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -15,8 +15,10 @@ Feel free to open an issue if you find any problems; we are actively developing
 1. [Installation](#installation)
 2. [Causal Tracing](#causal-tracing)
 3. [Rank-One Model Editing (ROME)](#rank-one-model-editing-rome-1)
-4. [CounterFact Dataset](#counterfact)
+4. [CounterFact](#counterfact)
 5. [Evaluation](#evaluation)
+    * [Running the Full Evaluation Suite](#running-the-full-evaluation-suite)
+    * [Integrating New Editing Methods](#integrating-new-editing-methods)
 6. [How to Cite](#how-to-cite)
 
 ## Installation
@@ -72,18 +74,68 @@ Several similar examples are included in the notebook.
 
 ## CounterFact
 
-Description coming soon!
+Details coming soon!
 
 ## Evaluation
 
-### Paper Baselines
-
-We compare ROME against several state-of-the-art model editors. All are implemented in [baselines/](baselines) in their respective folders. Implementations are not our own; they are adapted slightly to plug into our evaluation system.
-- Knowledge Neurons (KN): Dai et al. [[Code]](https://github.com/EleutherAI/knowledge-neurons) [[Paper]](https://arxiv.org/abs/2104.08696)
-- Knowledge Editor (KE): De Cao et al. [[Code]](https://github.com/eric-mitchell/mend) [[Paper]](https://arxiv.org/abs/2104.08164)
-- Model Editor Networks with Gradient Decomposition (MEND): Mitchell et al. [[Code]](https://github.com/eric-mitchell/mend) [[Paper]](https://arxiv.org/abs/2110.11309)
+See [`baselines/`](baselines/) for a description of the available baselines.
 
 ### Running the Full Evaluation Suite
+
+[`experiments/evaluate.py`](experiments/evaluate.py) can be used to evaluate any method in [`baselines/`](baselines/).
+To get started (e.g. using ROME on GPT-2 XL), run:
+```bash
+python3 -m experiments.evaluate \
+    --alg_name=ROME \
+    --model_name=gpt2-xl \
+    --hparams_fname=gpt2-xl.json
+```
+
+Results from each run are stored at `results/<method_name>/run_<run_id>` in a specific format:
+```bash
+results/
+|__ ROME/
+    |__ run_<run_id>/
+        |__ params.json
+        |__ case_0.json
+        |__ case_1.json
+        |__ ...
+        |__ case_10000.json
+```
+
+To summarize the results, you can use [`experiments/summarize.py`](experiments/summarize.py):
+```bash
+python3 -m experiments.summarize --dir_name=ROME --runs=run_<run_id>
+```
+
+Running `python3 -m experiments.evaluate -h` or `python3 -m experiments.summarize -h` provides details about command-line flags.
+
+### Integrating New Editing Methods
+
+<!-- Say you have a new method `X` and want to benchmark it on CounterFact. Here's a checklist for evaluating `X`:
+- The public method that evaluates a model on each CounterFact record is [`compute_rewrite_quality`](experiments/py/eval_utils.py); see [the source code](experiments/py/eval_utils.py) for details.
+- In your evaluation script, you should call `compute_rewrite_quality` once with an unedited model and once with a model that has been edited with `X`. Each time, the function returns a dictionary. -->
+
+Say you have a new method `X` and want to benchmark it on CounterFact. To integrate `X` with our runner:
+- Subclass [`HyperParams`](util/hparams.py) into `XHyperParams` and specify all hyperparameter fields. See [`ROMEHyperParameters`](rome/rome_hparams.py) for an example implementation.
+- Create a hyperparameters file at `hparams/X/gpt2-xl.json` and specify some default values. See [`hparams/ROME/gpt2-xl.json`](hparams/ROME/gpt2-xl.json) for an example.
+- Define a function `apply_X_to_model` which accepts several parameters and returns (i) the rewritten model and (ii) the original weight values for parameters that were edited (in the dictionary format `{weight_name: original_weight_value}`). See [`rome/rome_main.py`](rome/rome_main.py) for an example.
+- Add `X` to `ALG_DICT` in [`experiments/evaluate.py`](experiments/evaluate.py) by inserting the line `"X": (XHyperParams, apply_X_to_model)`.
+
+Finally, run the main scripts:
+```bash
+python3 -m experiments.evaluate \
+    --alg_name=X \
+    --model_name=gpt2-xl \
+    --hparams_fname=gpt2-xl.json
+
+python3 -m experiments.summarize --dir_name=X --runs=run_<run_id>
+```
+
+### Note on Cross-Platform Compatibility
+
+We currently only support methods that edit autoregressive HuggingFace models using the PyTorch backend. We are working on a set of general-purpose methods (usable on e.g. TensorFlow and without HuggingFace) that will be released soon.
+
 <!-- 
 Each method is customizable through a set of hyperparameters. For ROME, they are defined in `rome/hparams.py`. At runtime, you must specify a configuration of hyperparams through a `.json` file located in `hparams/<method_name>`. Check out [`hparams/ROME/default.json`](hparams/ROME/default.json) for an example.
 
@@ -92,15 +144,11 @@ At runtime, you must specify two command-line arguments: the method name, and th
 python3 -m experiments.evaluate --alg_name=ROME --hparams_fname=default.json
 ```
 
-Results from each run are stored in a directory of the form `results/<method_name>/run_<run_id>`.
-
 Running the following command will yield `dict` run summaries:
 ```bash
 python3 -m experiments/summarize --alg_name=ROME --run_name=run_001
 ``` -->
 
-Description coming soon!
-
 ## How to Cite
 
 ```bibtex

diff --git a/baselines/README.md b/baselines/README.md
@@ -0,0 +1,6 @@
+We compare ROME against several open sourced state-of-the-art model editors. All are implemented in their respective folders. Implementations other than FT/FT+L are adapted from third parties.
+- Fine-Tuning (`ft`): Direct fine-tuning.
+- Constrained Fine-Tuning (`ft`): FT with $L_\infty$ norm constraint. Inspired by Zhu et al. [[Paper]](https://arxiv.org/abs/2012.00363)
+- Knowledge Neurons (`kn`): Dai et al. [[Code]](https://github.com/EleutherAI/knowledge-neurons) [[Paper]](https://arxiv.org/abs/2104.08696)
+- Knowledge Editor (`efk`): De Cao et al. [[Code]](https://github.com/eric-mitchell/mend) [[Paper]](https://arxiv.org/abs/2104.08164)
+- Model Editor Networks with Gradient Decomposition (`mend`): Mitchell et al. [[Code]](https://github.com/eric-mitchell/mend) [[Paper]](https://arxiv.org/abs/2110.11309)
diff --git a/counterfact/README.md b/counterfact/README.md
diff --git a/experiments/evaluate.py b/experiments/evaluate.py
@@ -143,19 +143,50 @@ def main(
 
     parser = argparse.ArgumentParser()
     parser.add_argument(
-        "--alg_name", choices=["ROME", "FT", "KN", "MEND", "KE"], default="ROME"
+        "--alg_name",
+        choices=["ROME", "FT", "KN", "MEND", "KE"],
+        default="ROME",
+        help="Editing algorithm to use. Results are saved in results/<alg_name>/<run_id>, "
+        "where a new run_id is generated on each run. "
+        "If continuing from previous run, specify the run_id in --continue_from_run.",
     )
     parser.add_argument(
-        "--model_name", choices=["gpt2-xl", "EleutherAI/gpt-j-6B"], default="gpt2-xl"
+        "--model_name",
+        choices=["gpt2-xl", "EleutherAI/gpt-j-6B"],
+        default="gpt2-xl",
+        help="Model to edit.",
     )
-    parser.add_argument("--hparams_fname", type=str, default="gpt2-xl.json")
-    parser.add_argument("--continue_from_run", type=str, default=None)
-    parser.add_argument("--dataset_size_limit", type=int, default=10000)
     parser.add_argument(
-        "--skip_generation_tests", dest="skip_generation_tests", action="store_true"
+        "--hparams_fname",
+        type=str,
+        default="gpt2-xl.json",
+        help="Name of hyperparameters file, located in the hparams/<alg_name> folder.",
     )
     parser.add_argument(
-        "--conserve_memory", dest="conserve_memory", action="store_true"
+        "--continue_from_run",
+        type=str,
+        default=None,
+        help="If continuing from previous run, set to run_id. Otherwise, leave as None.",
+    )
+    parser.add_argument(
+        "--dataset_size_limit",
+        type=int,
+        default=10000,
+        help="Truncate CounterFact to first n records.",
+    )
+    parser.add_argument(
+        "--skip_generation_tests",
+        dest="skip_generation_tests",
+        action="store_true",
+        help="Only run fast probability-based tests without slow generation tests. "
+        "Useful for quick debugging and hyperparameter sweeps.",
+    )
+    parser.add_argument(
+        "--conserve_memory",
+        dest="conserve_memory",
+        action="store_true",
+        help="Reduce memory usage during evaluation at the cost of a minor slowdown. "
+        "Backs up model weights on CPU instead of GPU.",
     )
     parser.set_defaults(skip_generation_tests=False, conserve_memory=False)
     args = parser.parse_args()

diff --git a/experiments/summarize.py b/experiments/summarize.py
@@ -113,9 +113,23 @@ def main(
     import argparse
 
     parser = argparse.ArgumentParser()
-    parser.add_argument("--dir_name", type=str)
-    parser.add_argument("--runs", type=str, default=None)
-    parser.add_argument("--first_n_cases", type=int, default=None)
+    parser.add_argument(
+        "--dir_name", type=str, help="Name of directory to scan for runs."
+    )
+    parser.add_argument(
+        "--runs",
+        type=str,
+        default=None,
+        help="By default, summarizes each run in <dir_name>. "
+        "If runs are specified, only evaluates those specific runs.",
+    )
+    parser.add_argument(
+        "--first_n_cases",
+        type=int,
+        default=None,
+        help="Restricts evaluation to first n cases in dataset. "
+        "Useful for comparing different in-progress runs on the same slice of data.",
+    )
     args = parser.parse_args()
 
     main(

diff --git a/rome/compute_v.py b/rome/compute_v.py
@@ -41,9 +41,7 @@ def compute_v(
     target_ids = tok(request["target_new"]["str"])["input_ids"]
     if len(target_ids) > 1:
         print("-----------")
-        print(
-            "Warning: target is not a single token. "
-        )
+        print("Warning: target is not a single token. ")
         print("-----------")
 
     # Compute rewriting inputs and outputs

diff --git a/rome/layer_stats.py b/rome/layer_stats.py
@@ -111,9 +111,7 @@ def get_ds():
         model_name = model.config._name_or_path.replace("/", "_")
 
     stats_dir = Path(stats_dir)
-    file_extension = (
-        f"{model_name}/{ds_name}_stats/{layer_name}_{precision}_{'-'.join(sorted(to_collect))}{size_suffix}.npz"
-    )
+    file_extension = f"{model_name}/{ds_name}_stats/{layer_name}_{precision}_{'-'.join(sorted(to_collect))}{size_suffix}.npz"
     filename = stats_dir / file_extension
 
     if not filename.exists():

diff --git a/rome/rome_main.py b/rome/rome_main.py
@@ -23,7 +23,7 @@ def apply_rome_to_model(
     :param copy: If true, will preserve the original model while creating a new one to edit.
         Note that you are responsible for deallocating the new model's memory to avoid leaks.
 
-    :return: (1) the updated model, (2) the weights that changed
+    :return: (1) the updated model, (2) an original copy of the weights that changed
     """
 
     deltas = execute_rome(model, tok, request, hparams)