oobabooga · oobabooga · Feb 6, 2024 · Feb 1, 2024 · Feb 4, 2024 · Feb 4, 2024
diff --git a/docs/03 - Parameters Tab.md b/docs/03 - Parameters Tab.md
@@ -55,9 +55,10 @@ For more information about the parameters, the [transformers documentation](http
 * **mirostat_tau**: No idea, see the paper for details. According to the Preset Arena, 8 is a good value. 
 * **mirostat_eta**: No idea, see the paper for details. According to the Preset Arena, 0.1 is a good value.
 * **dynamic_temperature**: Activates Dynamic Temperature. This modifies temperature to range between "dynatemp_low" (minimum) and "dynatemp_high" (maximum), with an entropy-based scaling. The steepness of the curve is controlled by "dynatemp_exponent".
-* **temperature_last**: Makes temperature the last sampler instead of the first. With this, you can remove low probability tokens with a sampler like min_p and then use a high temperature to make the model creative without losing coherency.
+* **smoothing_factor**: Activates Quadratic Sampling. When `0 < smoothing_factor < 1`, the logits distribution becomes flatter. When `smoothing_factor > 1`, it becomes more peaked.
+* **temperature_last**: Makes temperature the last sampler instead of the first. With this, you can remove low probability tokens with a sampler like min_p and then use a high temperature to make the model creative without losing coherency. Note: this parameter takes precedence over "Sampler priority". That means that `temperature`/`dynamic_temperature`/`quadratic_sampling` will be removed from wherever they are and moved to the end of the stack.
 * **do_sample**: When unchecked, sampling is entirely disabled, and greedy decoding is used instead (the most likely token is always picked).
-* **Seed**: Set the Pytorch seed to this number. Note that some loaders do not use Pytorch (notably llama.cpp), and others are not deterministic (notably ExLlama v1 and v2). For these loaders, the seed has no effect.
+* **Seed**: Set the Pytorch seed to this number. Note that some loaders do not use Pytorch (notably llama.cpp), and others are not deterministic (ExLlamaV2). For these loaders, the seed has no effect.
 * **encoder_repetition_penalty**: Also known as the "Hallucinations filter". Used to penalize tokens that are *not* in the prior text. Higher value = more likely to stay in context, lower value = more likely to diverge.
 * **no_repeat_ngram_size**: If not set to 0, specifies the length of token sets that are completely blocked from repeating at all. Higher values = blocks larger phrases, lower values = blocks words or letters from repeating. Only 0 or high values are a good idea in most cases.
 * **min_length**: Minimum generation length in tokens. This is a built-in parameter in the transformers library that has never been very useful. Typically you want to check "Ban the eos_token" instead.
@@ -76,6 +77,7 @@ To the right (or below if you are on mobile), the following parameters are prese
 * **Add the bos_token to the beginning of prompts**: By default, the tokenizer will add a BOS (Beginning of Sequence) token to your prompt. During training, BOS tokens are used to separate different documents. If unchecked, no BOS token will be added, and the model will interpret your prompt as being in the middle of a document instead of at the start of one. This significantly changes the output and can make it more creative.
 * **Skip special tokens**: When decoding the generated tokens, skip special tokens from being converted to their text representation. Otherwise, BOS appears as `<s>`, EOS as `</s>`, etc.
 * **Activate text streaming**: When unchecked, the full response is outputted at once, without streaming the words one at a time. I recommend unchecking this parameter on high latency networks like running the webui on Google Colab or using `--share`.
+* **Sampler priority**: Allows you to customize the order in which the different samplers are applied. The first sampler on the list gets applied first. With this, custom orders like `top_p -> temperature -> top_k` can be defined.
 * **Load grammar from file**: Loads a GBNF grammar from a file under `text-generation-webui/grammars`. The output is written to the "Grammar" box below. You can also save and delete custom grammars using this menu.
 * **Grammar**: Allows you to constrain the model output to a particular format. For instance, you can make the model generate lists, JSON, specific words, etc. Grammar is extremely powerful and I highly recommend it. The syntax looks a bit daunting at first sight, but it gets very easy once you understand it. See the [GBNF Guide](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md) for details.
 

diff --git a/docs/04 - Model Tab.md b/docs/04 - Model Tab.md
@@ -42,7 +42,7 @@ Examples:
 * https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ
 
 * **gpu-split**: If you have multiple GPUs, the amount of memory to allocate per GPU should be set in this field. Make sure to set a lower value for the first GPU, as that's where the cache is allocated.
-* **max_seq_len**: The maximum sequence length for the model. In ExLlama, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on its metadata, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "max_seq_len" so that you don't have to set the same thing twice.
+* **max_seq_len**: The maximum sequence length for the model. In ExLlamaV2, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on its metadata, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "max_seq_len" so that you don't have to set the same thing twice.
 * **cfg-cache**: Creates a second cache to hold the CFG negative prompts. You need to set this if and only if you intend to use CFG in the "Parameters" > "Generation" tab. Checking this parameter doubles the cache VRAM usage.
 * **no_flash_attn**: Disables flash attention. Otherwise, it is automatically used as long as the library is installed.
 * **cache_8bit**: Create a 8-bit precision cache instead of a 16-bit one. This saves VRAM but increases perplexity (I don't know by how much).
@@ -57,7 +57,7 @@ Loads: GPTQ models.
 
 * **wbits**: For ancient models without proper metadata, sets the model precision in bits manually. Can usually be ignored.
 * **groupsize**: For ancient models without proper metadata, sets the model group size manually. Can usually be ignored.
-* **triton**: Only available on Linux. Necessary to use models with both act-order and groupsize simultaneously. Note that ExLlama can load these same models on Windows without triton.
+* **triton**: Only available on Linux. Necessary to use models with both act-order and groupsize simultaneously. Note that ExLlamaV2 can load these same models on Windows without triton.
 * **no_inject_fused_attention**: Improves performance while increasing the VRAM usage.
 * **no_inject_fused_mlp**: Similar to the previous parameter but for Triton only.
 * **no_use_cuda_fp16**: On some systems, the performance can be very bad with this unset. Can usually be ignored.
@@ -67,7 +67,7 @@ Loads: GPTQ models.
 
 Loads: GPTQ models.
 
-Ancient loader, the first one to implement 4-bit quantization. It works on older GPUs for which ExLlama and AutoGPTQ do not work, and it doesn't work with "act-order", so you should use it with simple 4-bit-128g models.
+Ancient loader, the first one to implement 4-bit quantization. It works on older GPUs for which ExLlamaV2 and AutoGPTQ do not work, and it doesn't work with "act-order", so you should use it with simple 4-bit-128g models.
 
 * **pre_layer**: Used for CPU offloading. The higher the number, the more layers will be sent to the GPU. GPTQ-for-LLaMa CPU offloading was faster than the one implemented in AutoGPTQ the last time I checked.
 

diff --git a/docs/What Works.md b/docs/What Works.md
@@ -2,15 +2,17 @@
 
 | Loader         | Loading 1 LoRA | Loading 2 or more LoRAs | Training LoRAs | Multimodal extension | Perplexity evaluation |
 |----------------|----------------|-------------------------|----------------|----------------------|-----------------------|
-| Transformers   |       ✅       |           ✅***            |       ✅*       |          ✅          |           ✅          |
+| Transformers   |       ✅       |           ✅\*\*\*      |       ✅\*     |          ✅          |           ✅          |
+| llama.cpp      |       ❌       |           ❌            |       ❌       |          ❌          |    use llamacpp_HF    |
+| llamacpp_HF    |       ❌       |           ❌            |       ❌       |          ❌          |           ✅          |
 | ExLlamav2_HF   |       ✅       |           ✅            |       ❌       |          ❌          |           ✅          |
-| ExLlamav2      |       ✅       |           ✅            |       ❌       |          ❌          |           use ExLlamav2_HF    |
+| ExLlamav2      |       ✅       |           ✅            |       ❌       |          ❌          |   use ExLlamav2_HF    |
 | AutoGPTQ       |       ✅       |           ❌            |       ❌       |          ✅          |           ✅          |
-| GPTQ-for-LLaMa |       ✅**       |           ✅***            |       ✅       |          ✅          |           ✅          |
-| llama.cpp      |       ❌       |           ❌            |       ❌       |          ❌          |           use llamacpp_HF    |
-| llamacpp_HF    |       ❌       |           ❌            |       ❌       |          ❌          |           ✅          |
+| AutoAWQ        |       ?        |           ❌            |       ?        |          ?           |           ✅          |
+| GPTQ-for-LLaMa |       ✅\*\*   |           ✅\*\*\*      |       ✅       |          ✅          |           ✅          |
 | ctransformers  |       ❌       |           ❌            |       ❌       |          ❌          |           ❌          |
-| AutoAWQ        |       ?        |           ❌            |       ?       |          ?          |           ✅          |
+| QuIP#          |       ?        |           ?             |       ?        |          ?           |           ✅          |
+| HQQ            |       ?        |           ?             |       ?        |          ?           |           ✅          |
 
 ❌ = not implemented
 

diff --git a/extensions/openai/typing.py b/extensions/openai/typing.py
@@ -12,6 +12,7 @@ class GenerationOptions(BaseModel):
     dynatemp_low: float = 1
     dynatemp_high: float = 1
     dynatemp_exponent: float = 1
+    smoothing_factor: float = 0
     top_k: int = 0
     repetition_penalty: float = 1
     repetition_penalty_range: int = 1024
@@ -39,6 +40,7 @@ class GenerationOptions(BaseModel):
     max_tokens_second: int = 0
     prompt_lookup_num_tokens: int = 0
     custom_token_bans: str = ""
+    sampler_priority: List[str] | str | None = Field(default=None, description="List of samplers where the first items will appear first in the stack. Example: [\"top_k\", \"temperature\", \"top_p\"].")
     auto_max_new_tokens: bool = False
     ban_eos_token: bool = False
     add_bos_token: bool = True

diff --git a/instruction-templates/ChatML.yaml b/instruction-templates/ChatML.yaml
@@ -5,15 +5,12 @@ instruction_template: |-
           {%- set ns.found = true -%}
       {%- endif -%}
   {%- endfor -%}
-  {%- if not ns.found -%}
-      {{- '<|im_start|>system\n' + '' + '<|im_end|>\n' -}}
-  {%- endif %}
   {%- for message in messages %}
       {%- if message['role'] == 'system' -%}
-          {{- '<|im_start|>system\n' + message['content'] + '<|im_end|>\n' -}}
+          {{- '<|im_start|>system\n' + message['content'].rstrip() + '<|im_end|>\n' -}}
       {%- else -%}
           {%- if message['role'] == 'user' -%}
-              {{-'<|im_start|>user\n' + message['content'] + '<|im_end|>\n'-}}
+              {{-'<|im_start|>user\n' + message['content'].rstrip() + '<|im_end|>\n'-}}
           {%- else -%}
               {{-'<|im_start|>assistant\n' + message['content'] + '<|im_end|>\n' -}}
           {%- endif -%}

diff --git a/modules/LoRA.py b/modules/LoRA.py
@@ -12,7 +12,7 @@
 def add_lora_to_model(lora_names):
     if 'GPTQForCausalLM' in shared.model.__class__.__name__ or shared.args.loader == 'AutoGPTQ':
         add_lora_autogptq(lora_names)
-    elif shared.model.__class__.__name__ in ['Exllamav2Model', 'Exllamav2HF'] or shared.args.loader == ['ExLlamav2', 'ExLlamav2_HF']:
+    elif shared.model.__class__.__name__ in ['Exllamav2Model', 'Exllamav2HF'] or shared.args.loader in ['ExLlamav2', 'ExLlamav2_HF']:
         add_lora_exllamav2(lora_names)
     else:
         add_lora_transformers(lora_names)

diff --git a/modules/chat.py b/modules/chat.py
@@ -166,18 +166,53 @@ def make_prompt(messages):
         prompt = remove_extra_bos(prompt)
         return prompt
 
-    prompt = make_prompt(messages)
-
     # Handle truncation
     max_length = get_max_prompt_length(state)
-    while len(messages) > 0 and get_encoded_length(prompt) > max_length:
-        # Try to save the system message
-        if len(messages) > 1 and messages[0]['role'] == 'system':
+    prompt = make_prompt(messages)
+    encoded_length = get_encoded_length(prompt)
+
+    while len(messages) > 0 and encoded_length > max_length:
+
+        # Remove old message, save system message
+        if len(messages) > 2 and messages[0]['role'] == 'system':
             messages.pop(1)
-        else:
+
+        # Remove old message when no system message is present
+        elif len(messages) > 1 and messages[0]['role'] != 'system':
             messages.pop(0)
 
+        # Resort to truncating the user input
+        else:
+
+            user_message = messages[-1]['content']
+
+            # Bisect the truncation point
+            left, right = 0, len(user_message) - 1
+
+            while right - left > 1:
+                mid = (left + right) // 2
+
+                messages[-1]['content'] = user_message[mid:]
+                prompt = make_prompt(messages)
+                encoded_length = get_encoded_length(prompt)
+
+                if encoded_length <= max_length:
+                    right = mid
+                else:
+                    left = mid
+
+            messages[-1]['content'] = user_message[right:]
+            prompt = make_prompt(messages)
+            encoded_length = get_encoded_length(prompt)
+            if encoded_length > max_length:
+                logger.error(f"Failed to build the chat prompt. The input is too long for the available context length.\n\nTruncation length: {state['truncation_length']}\nmax_new_tokens: {state['max_new_tokens']} (is it too high?)\nAvailable context length: {max_length}\n")
+                raise ValueError
+            else:
+                logger.warning(f"The input has been truncated. Context length: {state['truncation_length']}, max_new_tokens: {state['max_new_tokens']}, available context length: {max_length}.")
+                break
+
         prompt = make_prompt(messages)
+        encoded_length = get_encoded_length(prompt)
 
     if also_return_rows:
         return prompt, [message['content'] for message in messages]

diff --git a/modules/llamacpp_hf.py b/modules/llamacpp_hf.py
@@ -216,7 +216,8 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
             'tensor_split': tensor_split_list,
             'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
             'logits_all': shared.args.logits_all,
-            'offload_kqv': not shared.args.no_offload_kqv
+            'offload_kqv': not shared.args.no_offload_kqv,
+            'split_mode': 1 if not shared.args.row_split else 2
         }
 
         Llama = llama_cpp_lib().Llama

diff --git a/modules/llamacpp_model.py b/modules/llamacpp_model.py
@@ -95,7 +95,8 @@ def from_pretrained(self, path):
             'rope_freq_base': RoPE.get_rope_freq_base(shared.args.alpha_value, shared.args.rope_freq_base),
             'tensor_split': tensor_split_list,
             'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
-            'offload_kqv': not shared.args.no_offload_kqv
+            'offload_kqv': not shared.args.no_offload_kqv,
+            'split_mode': 1 if not shared.args.row_split else 2
         }
 
         result.model = Llama(**params)

diff --git a/modules/loaders.py b/modules/loaders.py
@@ -26,7 +26,7 @@
         'compress_pos_emb',
         'disable_exllama',
         'disable_exllamav2',
-        'transformers_info'
+        'transformers_info',
     ],
     'llama.cpp': [
         'n_ctx',
@@ -44,6 +44,7 @@
         'cpu',
         'numa',
         'no_offload_kqv',
+        'row_split',
         'tensorcores',
     ],
     'llamacpp_HF': [
@@ -66,6 +67,7 @@
         'no_use_fast',
         'logits_all',
         'no_offload_kqv',
+        'row_split',
         'tensorcores',
         'llamacpp_HF_info',
     ],
@@ -159,6 +161,7 @@ def transformers_samplers():
         'dynatemp_low',
         'dynatemp_high',
         'dynatemp_exponent',
+        'smoothing_factor',
         'top_p',
         'min_p',
         'top_k',
@@ -189,6 +192,7 @@ def transformers_samplers():
         'negative_prompt',
         'ban_eos_token',
         'custom_token_bans',
+        'sampler_priority',
         'add_bos_token',
         'skip_special_tokens',
         'auto_max_new_tokens',
@@ -233,6 +237,7 @@ def transformers_samplers():
         'dynatemp_low',
         'dynatemp_high',
         'dynatemp_exponent',
+        'smoothing_factor',
         'top_p',
         'min_p',
         'top_k',
@@ -259,6 +264,7 @@ def transformers_samplers():
         'negative_prompt',
         'ban_eos_token',
         'custom_token_bans',
+        'sampler_priority',
         'add_bos_token',
         'skip_special_tokens',
         'auto_max_new_tokens',
@@ -289,6 +295,7 @@ def transformers_samplers():
         'dynatemp_low',
         'dynatemp_high',
         'dynatemp_exponent',
+        'smoothing_factor',
         'top_p',
         'min_p',
         'top_k',
@@ -315,6 +322,7 @@ def transformers_samplers():
         'negative_prompt',
         'ban_eos_token',
         'custom_token_bans',
+        'sampler_priority',
         'add_bos_token',
         'skip_special_tokens',
         'auto_max_new_tokens',

diff --git a/modules/models.py b/modules/models.py
@@ -100,9 +100,9 @@ def load_model(model_name, loader=None):
     elif loader in ['llama.cpp', 'llamacpp_HF', 'ctransformers']:
         shared.settings['truncation_length'] = shared.args.n_ctx
 
-    logger.info(f"LOADER: {loader}")
+    logger.info(f"LOADER: \"{loader}\"")
     logger.info(f"TRUNCATION LENGTH: {shared.settings['truncation_length']}")
-    logger.info(f"INSTRUCTION TEMPLATE: {metadata['instruction_template']}")
+    logger.info(f"INSTRUCTION TEMPLATE: \"{metadata['instruction_template']}\"")
     logger.info(f"Loaded the model in {(time.time()-t0):.2f} seconds.")
     return model, tokenizer
 

diff --git a/modules/presets.py b/modules/presets.py
@@ -17,6 +17,7 @@ def default_preset():
         'dynatemp_low': 1,
         'dynatemp_high': 1,
         'dynatemp_exponent': 1,
+        'smoothing_factor': 0,
         'top_p': 1,
         'min_p': 0,
         'top_k': 0,
@@ -41,6 +42,7 @@ def default_preset():
         'num_beams': 1,
         'length_penalty': 1,
         'early_stopping': False,
+        'sampler_priority': 'temperature\ndynamic_temperature\nquadratic_sampling\ntop_k\ntop_p\ntypical_p\nepsilon_cutoff\neta_cutoff\ntfs\ntop_a\nmin_p\nmirostat'
     }