Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge dev branch #5452

Merged
merged 19 commits into from
Feb 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/03 - Parameters Tab.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,10 @@ For more information about the parameters, the [transformers documentation](http
* **mirostat_tau**: No idea, see the paper for details. According to the Preset Arena, 8 is a good value.
* **mirostat_eta**: No idea, see the paper for details. According to the Preset Arena, 0.1 is a good value.
* **dynamic_temperature**: Activates Dynamic Temperature. This modifies temperature to range between "dynatemp_low" (minimum) and "dynatemp_high" (maximum), with an entropy-based scaling. The steepness of the curve is controlled by "dynatemp_exponent".
* **temperature_last**: Makes temperature the last sampler instead of the first. With this, you can remove low probability tokens with a sampler like min_p and then use a high temperature to make the model creative without losing coherency.
* **smoothing_factor**: Activates Quadratic Sampling. When `0 < smoothing_factor < 1`, the logits distribution becomes flatter. When `smoothing_factor > 1`, it becomes more peaked.
* **temperature_last**: Makes temperature the last sampler instead of the first. With this, you can remove low probability tokens with a sampler like min_p and then use a high temperature to make the model creative without losing coherency. Note: this parameter takes precedence over "Sampler priority". That means that `temperature`/`dynamic_temperature`/`quadratic_sampling` will be removed from wherever they are and moved to the end of the stack.
* **do_sample**: When unchecked, sampling is entirely disabled, and greedy decoding is used instead (the most likely token is always picked).
* **Seed**: Set the Pytorch seed to this number. Note that some loaders do not use Pytorch (notably llama.cpp), and others are not deterministic (notably ExLlama v1 and v2). For these loaders, the seed has no effect.
* **Seed**: Set the Pytorch seed to this number. Note that some loaders do not use Pytorch (notably llama.cpp), and others are not deterministic (ExLlamaV2). For these loaders, the seed has no effect.
* **encoder_repetition_penalty**: Also known as the "Hallucinations filter". Used to penalize tokens that are *not* in the prior text. Higher value = more likely to stay in context, lower value = more likely to diverge.
* **no_repeat_ngram_size**: If not set to 0, specifies the length of token sets that are completely blocked from repeating at all. Higher values = blocks larger phrases, lower values = blocks words or letters from repeating. Only 0 or high values are a good idea in most cases.
* **min_length**: Minimum generation length in tokens. This is a built-in parameter in the transformers library that has never been very useful. Typically you want to check "Ban the eos_token" instead.
Expand All @@ -76,6 +77,7 @@ To the right (or below if you are on mobile), the following parameters are prese
* **Add the bos_token to the beginning of prompts**: By default, the tokenizer will add a BOS (Beginning of Sequence) token to your prompt. During training, BOS tokens are used to separate different documents. If unchecked, no BOS token will be added, and the model will interpret your prompt as being in the middle of a document instead of at the start of one. This significantly changes the output and can make it more creative.
* **Skip special tokens**: When decoding the generated tokens, skip special tokens from being converted to their text representation. Otherwise, BOS appears as `<s>`, EOS as `</s>`, etc.
* **Activate text streaming**: When unchecked, the full response is outputted at once, without streaming the words one at a time. I recommend unchecking this parameter on high latency networks like running the webui on Google Colab or using `--share`.
* **Sampler priority**: Allows you to customize the order in which the different samplers are applied. The first sampler on the list gets applied first. With this, custom orders like `top_p -> temperature -> top_k` can be defined.
* **Load grammar from file**: Loads a GBNF grammar from a file under `text-generation-webui/grammars`. The output is written to the "Grammar" box below. You can also save and delete custom grammars using this menu.
* **Grammar**: Allows you to constrain the model output to a particular format. For instance, you can make the model generate lists, JSON, specific words, etc. Grammar is extremely powerful and I highly recommend it. The syntax looks a bit daunting at first sight, but it gets very easy once you understand it. See the [GBNF Guide](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md) for details.

Expand Down
6 changes: 3 additions & 3 deletions docs/04 - Model Tab.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Examples:
* https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ

* **gpu-split**: If you have multiple GPUs, the amount of memory to allocate per GPU should be set in this field. Make sure to set a lower value for the first GPU, as that's where the cache is allocated.
* **max_seq_len**: The maximum sequence length for the model. In ExLlama, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on its metadata, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "max_seq_len" so that you don't have to set the same thing twice.
* **max_seq_len**: The maximum sequence length for the model. In ExLlamaV2, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on its metadata, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "max_seq_len" so that you don't have to set the same thing twice.
* **cfg-cache**: Creates a second cache to hold the CFG negative prompts. You need to set this if and only if you intend to use CFG in the "Parameters" > "Generation" tab. Checking this parameter doubles the cache VRAM usage.
* **no_flash_attn**: Disables flash attention. Otherwise, it is automatically used as long as the library is installed.
* **cache_8bit**: Create a 8-bit precision cache instead of a 16-bit one. This saves VRAM but increases perplexity (I don't know by how much).
Expand All @@ -57,7 +57,7 @@ Loads: GPTQ models.

* **wbits**: For ancient models without proper metadata, sets the model precision in bits manually. Can usually be ignored.
* **groupsize**: For ancient models without proper metadata, sets the model group size manually. Can usually be ignored.
* **triton**: Only available on Linux. Necessary to use models with both act-order and groupsize simultaneously. Note that ExLlama can load these same models on Windows without triton.
* **triton**: Only available on Linux. Necessary to use models with both act-order and groupsize simultaneously. Note that ExLlamaV2 can load these same models on Windows without triton.
* **no_inject_fused_attention**: Improves performance while increasing the VRAM usage.
* **no_inject_fused_mlp**: Similar to the previous parameter but for Triton only.
* **no_use_cuda_fp16**: On some systems, the performance can be very bad with this unset. Can usually be ignored.
Expand All @@ -67,7 +67,7 @@ Loads: GPTQ models.

Loads: GPTQ models.

Ancient loader, the first one to implement 4-bit quantization. It works on older GPUs for which ExLlama and AutoGPTQ do not work, and it doesn't work with "act-order", so you should use it with simple 4-bit-128g models.
Ancient loader, the first one to implement 4-bit quantization. It works on older GPUs for which ExLlamaV2 and AutoGPTQ do not work, and it doesn't work with "act-order", so you should use it with simple 4-bit-128g models.

* **pre_layer**: Used for CPU offloading. The higher the number, the more layers will be sent to the GPU. GPTQ-for-LLaMa CPU offloading was faster than the one implemented in AutoGPTQ the last time I checked.

Expand Down
14 changes: 8 additions & 6 deletions docs/What Works.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,17 @@

| Loader | Loading 1 LoRA | Loading 2 or more LoRAs | Training LoRAs | Multimodal extension | Perplexity evaluation |
|----------------|----------------|-------------------------|----------------|----------------------|-----------------------|
| Transformers | ✅ | ✅*** | ✅* | ✅ | ✅ |
| Transformers | ✅ | ✅\*\*\* | ✅\* | ✅ | ✅ |
| llama.cpp | ❌ | ❌ | ❌ | ❌ | use llamacpp_HF |
| llamacpp_HF | ❌ | ❌ | ❌ | ❌ | ✅ |
| ExLlamav2_HF | ✅ | ✅ | ❌ | ❌ | ✅ |
| ExLlamav2 | ✅ | ✅ | ❌ | ❌ | use ExLlamav2_HF |
| ExLlamav2 | ✅ | ✅ | ❌ | ❌ | use ExLlamav2_HF |
| AutoGPTQ | ✅ | ❌ | ❌ | ✅ | ✅ |
| GPTQ-for-LLaMa | ✅** | ✅*** | ✅ | ✅ | ✅ |
| llama.cpp | ❌ | ❌ | ❌ | ❌ | use llamacpp_HF |
| llamacpp_HF | ❌ | ❌ | ❌ | ❌ | ✅ |
| AutoAWQ | ? | ❌ | ? | ? | ✅ |
| GPTQ-for-LLaMa | ✅\*\* | ✅\*\*\* | ✅ | ✅ | ✅ |
| ctransformers | ❌ | ❌ | ❌ | ❌ | ❌ |
| AutoAWQ | ? | ❌ | ? | ? | ✅ |
| QuIP# | ? | ? | ? | ? | ✅ |
| HQQ | ? | ? | ? | ? | ✅ |

❌ = not implemented

Expand Down
2 changes: 2 additions & 0 deletions extensions/openai/typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ class GenerationOptions(BaseModel):
dynatemp_low: float = 1
dynatemp_high: float = 1
dynatemp_exponent: float = 1
smoothing_factor: float = 0
top_k: int = 0
repetition_penalty: float = 1
repetition_penalty_range: int = 1024
Expand Down Expand Up @@ -39,6 +40,7 @@ class GenerationOptions(BaseModel):
max_tokens_second: int = 0
prompt_lookup_num_tokens: int = 0
custom_token_bans: str = ""
sampler_priority: List[str] | str | None = Field(default=None, description="List of samplers where the first items will appear first in the stack. Example: [\"top_k\", \"temperature\", \"top_p\"].")
auto_max_new_tokens: bool = False
ban_eos_token: bool = False
add_bos_token: bool = True
Expand Down
7 changes: 2 additions & 5 deletions instruction-templates/ChatML.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,12 @@ instruction_template: |-
{%- set ns.found = true -%}
{%- endif -%}
{%- endfor -%}
{%- if not ns.found -%}
{{- '<|im_start|>system\n' + '' + '<|im_end|>\n' -}}
{%- endif %}
{%- for message in messages %}
{%- if message['role'] == 'system' -%}
{{- '<|im_start|>system\n' + message['content'] + '<|im_end|>\n' -}}
{{- '<|im_start|>system\n' + message['content'].rstrip() + '<|im_end|>\n' -}}
{%- else -%}
{%- if message['role'] == 'user' -%}
{{-'<|im_start|>user\n' + message['content'] + '<|im_end|>\n'-}}
{{-'<|im_start|>user\n' + message['content'].rstrip() + '<|im_end|>\n'-}}
{%- else -%}
{{-'<|im_start|>assistant\n' + message['content'] + '<|im_end|>\n' -}}
{%- endif -%}
Expand Down
2 changes: 1 addition & 1 deletion modules/LoRA.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
def add_lora_to_model(lora_names):
if 'GPTQForCausalLM' in shared.model.__class__.__name__ or shared.args.loader == 'AutoGPTQ':
add_lora_autogptq(lora_names)
elif shared.model.__class__.__name__ in ['Exllamav2Model', 'Exllamav2HF'] or shared.args.loader == ['ExLlamav2', 'ExLlamav2_HF']:
elif shared.model.__class__.__name__ in ['Exllamav2Model', 'Exllamav2HF'] or shared.args.loader in ['ExLlamav2', 'ExLlamav2_HF']:
add_lora_exllamav2(lora_names)
else:
add_lora_transformers(lora_names)
Expand Down
47 changes: 41 additions & 6 deletions modules/chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,18 +166,53 @@ def make_prompt(messages):
prompt = remove_extra_bos(prompt)
return prompt

prompt = make_prompt(messages)

# Handle truncation
max_length = get_max_prompt_length(state)
while len(messages) > 0 and get_encoded_length(prompt) > max_length:
# Try to save the system message
if len(messages) > 1 and messages[0]['role'] == 'system':
prompt = make_prompt(messages)
encoded_length = get_encoded_length(prompt)

while len(messages) > 0 and encoded_length > max_length:

# Remove old message, save system message
if len(messages) > 2 and messages[0]['role'] == 'system':
messages.pop(1)
else:

# Remove old message when no system message is present
elif len(messages) > 1 and messages[0]['role'] != 'system':
messages.pop(0)

# Resort to truncating the user input
else:

user_message = messages[-1]['content']

# Bisect the truncation point
left, right = 0, len(user_message) - 1

while right - left > 1:
mid = (left + right) // 2

messages[-1]['content'] = user_message[mid:]
prompt = make_prompt(messages)
encoded_length = get_encoded_length(prompt)

if encoded_length <= max_length:
right = mid
else:
left = mid

messages[-1]['content'] = user_message[right:]
prompt = make_prompt(messages)
encoded_length = get_encoded_length(prompt)
if encoded_length > max_length:
logger.error(f"Failed to build the chat prompt. The input is too long for the available context length.\n\nTruncation length: {state['truncation_length']}\nmax_new_tokens: {state['max_new_tokens']} (is it too high?)\nAvailable context length: {max_length}\n")
raise ValueError
else:
logger.warning(f"The input has been truncated. Context length: {state['truncation_length']}, max_new_tokens: {state['max_new_tokens']}, available context length: {max_length}.")
break

prompt = make_prompt(messages)
encoded_length = get_encoded_length(prompt)

if also_return_rows:
return prompt, [message['content'] for message in messages]
Expand Down
3 changes: 2 additions & 1 deletion modules/llamacpp_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,8 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
'tensor_split': tensor_split_list,
'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
'logits_all': shared.args.logits_all,
'offload_kqv': not shared.args.no_offload_kqv
'offload_kqv': not shared.args.no_offload_kqv,
'split_mode': 1 if not shared.args.row_split else 2
}

Llama = llama_cpp_lib().Llama
Expand Down
3 changes: 2 additions & 1 deletion modules/llamacpp_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,8 @@ def from_pretrained(self, path):
'rope_freq_base': RoPE.get_rope_freq_base(shared.args.alpha_value, shared.args.rope_freq_base),
'tensor_split': tensor_split_list,
'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
'offload_kqv': not shared.args.no_offload_kqv
'offload_kqv': not shared.args.no_offload_kqv,
'split_mode': 1 if not shared.args.row_split else 2
}

result.model = Llama(**params)
Expand Down
10 changes: 9 additions & 1 deletion modules/loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
'compress_pos_emb',
'disable_exllama',
'disable_exllamav2',
'transformers_info'
'transformers_info',
],
'llama.cpp': [
'n_ctx',
Expand All @@ -44,6 +44,7 @@
'cpu',
'numa',
'no_offload_kqv',
'row_split',
'tensorcores',
],
'llamacpp_HF': [
Expand All @@ -66,6 +67,7 @@
'no_use_fast',
'logits_all',
'no_offload_kqv',
'row_split',
'tensorcores',
'llamacpp_HF_info',
],
Expand Down Expand Up @@ -159,6 +161,7 @@ def transformers_samplers():
'dynatemp_low',
'dynatemp_high',
'dynatemp_exponent',
'smoothing_factor',
'top_p',
'min_p',
'top_k',
Expand Down Expand Up @@ -189,6 +192,7 @@ def transformers_samplers():
'negative_prompt',
'ban_eos_token',
'custom_token_bans',
'sampler_priority',
'add_bos_token',
'skip_special_tokens',
'auto_max_new_tokens',
Expand Down Expand Up @@ -233,6 +237,7 @@ def transformers_samplers():
'dynatemp_low',
'dynatemp_high',
'dynatemp_exponent',
'smoothing_factor',
'top_p',
'min_p',
'top_k',
Expand All @@ -259,6 +264,7 @@ def transformers_samplers():
'negative_prompt',
'ban_eos_token',
'custom_token_bans',
'sampler_priority',
'add_bos_token',
'skip_special_tokens',
'auto_max_new_tokens',
Expand Down Expand Up @@ -289,6 +295,7 @@ def transformers_samplers():
'dynatemp_low',
'dynatemp_high',
'dynatemp_exponent',
'smoothing_factor',
'top_p',
'min_p',
'top_k',
Expand All @@ -315,6 +322,7 @@ def transformers_samplers():
'negative_prompt',
'ban_eos_token',
'custom_token_bans',
'sampler_priority',
'add_bos_token',
'skip_special_tokens',
'auto_max_new_tokens',
Expand Down
4 changes: 2 additions & 2 deletions modules/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,9 @@ def load_model(model_name, loader=None):
elif loader in ['llama.cpp', 'llamacpp_HF', 'ctransformers']:
shared.settings['truncation_length'] = shared.args.n_ctx

logger.info(f"LOADER: {loader}")
logger.info(f"LOADER: \"{loader}\"")
logger.info(f"TRUNCATION LENGTH: {shared.settings['truncation_length']}")
logger.info(f"INSTRUCTION TEMPLATE: {metadata['instruction_template']}")
logger.info(f"INSTRUCTION TEMPLATE: \"{metadata['instruction_template']}\"")
logger.info(f"Loaded the model in {(time.time()-t0):.2f} seconds.")
return model, tokenizer

Expand Down
2 changes: 2 additions & 0 deletions modules/presets.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ def default_preset():
'dynatemp_low': 1,
'dynatemp_high': 1,
'dynatemp_exponent': 1,
'smoothing_factor': 0,
'top_p': 1,
'min_p': 0,
'top_k': 0,
Expand All @@ -41,6 +42,7 @@ def default_preset():
'num_beams': 1,
'length_penalty': 1,
'early_stopping': False,
'sampler_priority': 'temperature\ndynamic_temperature\nquadratic_sampling\ntop_k\ntop_p\ntypical_p\nepsilon_cutoff\neta_cutoff\ntfs\ntop_a\nmin_p\nmirostat'
}


Expand Down
Loading