📝 Update LLaMa docs (#415)

## Describe your changes Update Llama docs to tell the different optimization config for various model size. ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Format your code by running `pre-commit run --all-files` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. ## (Optional) Issue link
microsoft · Jul 15, 2023 · 7a6de53 · 7a6de53
1 parent fe6e3da
commit 7a6de53
Show file tree

Hide file tree

Showing 2 changed files with 50 additions and 2 deletions.
diff --git a/examples/open_llama/README.md b/examples/open_llama/README.md
@@ -7,6 +7,54 @@ This workflow also demonstrates how to use:
 
 This example config file [open_llama_config.json](open_llama_config.json) is meant to be a starting point for optimizing Open LLaMA for target hardware. One can add additional passes as well as set different options for Transformer Optimization pass as per need. See [Olive documentation](https://microsoft.github.io/Olive/) for more information on optimizations passes available.
 
+Note that this example config uses [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b) for demonstration purpose. There are other models available in [Open LLaMA](https://huggingface.co/openlm-research) that can be used for optimization. The following table shows a few models' configurations:
+
+| Model | Num Hidden Layers| Num Attention Heads | Hidden Size |
+| --- | --- | --- | --- |
+| [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b) | 26 | 32 | 3200 |
+| [openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b) | 32 | 32 | 4096 |
+| [openlm-research/open_llama_13b](https://huggingface.co/openlm-research/open_llama_13b) | 40 | 40 | 5120 |
+
+
+When you run the example config for other larger models, you may need
+1. change the `model_path` to the one you use.
+```json
+"input_model":{
+    "type": "OptimumModel",
+    "config": {
+        "model_path": "openlm-research/open_llama_3b", // to change based on the model you use
+        "model_components": ["decoder_model.onnx", "decoder_with_past_model.onnx"],
+        "hf_config": {
+            "model_class": "LlamaForCausalLM"
+        }
+    }
+}
+```
+2. change the transformer optimization pass options in `open_llama_config.json` based on the above table:
+```json
+"optimize": {
+    "type": "OrtTransformersOptimization",
+    "config": {
+        "model_type": "gpt2",
+        "float16": true,
+        "use_gpu": false,
+        "keep_io_types": true,
+        "num_heads": 32, // to change based on the model you use
+        "hidden_size": 4096, // to change based on the model you use
+        "optimization_options": {
+            "use_multi_head_attention": false
+        }
+    }
+}
+```
+3. increase the `num_hidden_layers` for dummy inputs in `user_script.py`.
+```python
+// to increase `num_hidden_layers` to conduct proper inputs data
+def dummy_inputs(batch_size, torch_dtype, model_framework=Framework.PYTORCH, num_hidden_layers=26):
+    past_sequence_length = 1
+    attention_mask_sequence_length = 1
+```
+
 ### Run sample using config
 
 The optimization techniques to run are specified in the relevant config json file.

diff --git a/examples/open_llama/user_script.py b/examples/open_llama/user_script.py
@@ -19,7 +19,7 @@ def __getitem__(self, idx):
         return self.create_input_func(self.batch_size, self.torch_dtype, self.model_framework), label
 
 
-def dummy_inputs(batch_size, torch_dtype, model_framework=Framework.PYTORCH):
+def dummy_inputs(batch_size, torch_dtype, model_framework=Framework.PYTORCH, num_hidden_layers=26):
     past_sequence_length = 1
     attention_mask_sequence_length = 1
     sequence_length = 2
@@ -30,7 +30,7 @@ def dummy_inputs(batch_size, torch_dtype, model_framework=Framework.PYTORCH):
     }
 
     if model_framework == Framework.ONNX:
-        for layer_index in range(26):
+        for layer_index in range(num_hidden_layers):
             inputs[f"past_key_values.{layer_index}.key"] = torch.rand(
                 (batch_size, 32, past_sequence_length, 100), dtype=torch_dtype
             )