Skip to content

Commit

Permalink
Merge pull request #5680 from oobabooga/dev
Browse files Browse the repository at this point in the history
Merge dev branch
  • Loading branch information
oobabooga authored Mar 11, 2024
2 parents aa0da07 + 0567179 commit 1934cb6
Show file tree
Hide file tree
Showing 38 changed files with 451 additions and 337 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,9 @@ List of command-line flags
| `--logits_all`| Needs to be set for perplexity evaluation to work. Otherwise, ignore it, as it makes prompt processing slower. |
| `--no_offload_kqv` | Do not offload the K, Q, V to the GPU. This saves VRAM but reduces the performance. |
| `--cache-capacity CACHE_CAPACITY` | Maximum cache capacity (llama-cpp-python). Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed. |
| `--row_split` | Split the model by rows across GPUs. This may improve multi-gpu performance. |
| `--streaming-llm` | Activate StreamingLLM to avoid re-evaluating the entire prompt when old messages are removed. |
| `--attention-sink-size ATTENTION_SINK_SIZE` | StreamingLLM: number of sink tokens. Only used if the trimmed prompt doesn't share a prefix with the old prompt. |

#### ExLlamav2

Expand Down
9 changes: 5 additions & 4 deletions docs/04 - Model Tab.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,16 +80,17 @@ Example: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF

* **n-gpu-layers**: The number of layers to allocate to the GPU. If set to 0, only the CPU will be used. If you want to offload all layers, you can simply set this to the maximum value.
* **n_ctx**: Context length of the model. In llama.cpp, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on the metadata inside the GGUF file, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "n_ctx" so that you don't have to set the same thing twice.
* **tensor_split**: For multi-gpu only. Sets the amount of memory to allocate per GPU as proportions. Not to be confused with other loaders where this is set in GB; here you can set something like `30,70` for 30%/70%.
* **n_batch**: Batch size for prompt processing. Higher values are supposed to make generation faster, but I have never obtained any benefit from changing this value.
* **threads**: Number of threads. Recommended value: your number of physical cores.
* **threads_batch**: Number of threads for batch processing. Recommended value: your total number of cores (physical + virtual).
* **n_batch**: Batch size for prompt processing. Higher values are supposed to make generation faster, but I have never obtained any benefit from changing this value.
* **tensorcores**: Use llama.cpp compiled with "tensor cores" support, which improves performance on NVIDIA RTX cards in most cases.
* **streamingllm**: Experimental feature to avoid re-evaluating the entire prompt when part of it is removed, for instance, when you hit the context length for the model in chat mode and an old message is removed.
* **cpu**: Force a version of llama.cpp compiled without GPU acceleration to be used. Can usually be ignored. Only set this if you want to use CPU only and llama.cpp doesn't work otherwise.
* **no_mul_mat_q**: Disable the mul_mat_q kernel. This kernel usually improves generation speed significantly. This option to disable it is included in case it doesn't work on some system.
* **no-mmap**: Loads the model into memory at once, possibly preventing I/O operations later on at the cost of a longer load time.
* **mlock**: Force the system to keep the model in RAM rather than swapping or compressing (no idea what this means, never used it).
* **numa**: May improve performance on certain multi-cpu systems.
* **cpu**: Force a version of llama.cpp compiled without GPU acceleration to be used. Can usually be ignored. Only set this if you want to use CPU only and llama.cpp doesn't work otherwise.
* **tensor_split**: For multi-gpu only. Sets the amount of memory to allocate per GPU.
* **Seed**: The seed for the llama.cpp random number generator. Not very useful as it can only be set once (that I'm aware).

### llamacpp_HF

Expand Down
8 changes: 4 additions & 4 deletions extensions/openai/completions.py
Original file line number Diff line number Diff line change
Expand Up @@ -250,13 +250,13 @@ def chat_completions_common(body: dict, is_legacy: bool = False, stream=False) -
else:
instruction_template_str = shared.settings['instruction_template_str']

chat_template_str = body['chat_template_str'] or shared.settings['chat_template_str']
chat_instruct_command = body['chat_instruct_command'] or shared.settings['chat-instruct_command']
chat_template_str = body['chat_template_str'] or shared.default_settings['chat_template_str']
chat_instruct_command = body['chat_instruct_command'] or shared.default_settings['chat-instruct_command']

# Chat character
character = body['character'] or shared.settings['character']
character = body['character'] or shared.default_settings['character']
character = "Assistant" if character == "None" else character
name1 = body['user_name'] or shared.settings['name1']
name1 = body['user_name'] or shared.default_settings['name1']
name1, name2, _, greeting, context = load_character_memoized(character, name1, '')
name2 = body['bot_name'] or name2
context = body['context'] or context
Expand Down
52 changes: 12 additions & 40 deletions extensions/superbooga/chromadb.py
Original file line number Diff line number Diff line change
@@ -1,43 +1,24 @@
import random

import chromadb
import posthog
import torch
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer

from modules.logging_colors import logger
from chromadb.utils import embedding_functions

logger.info('Intercepting all calls to posthog :)')
# Intercept calls to posthog
posthog.capture = lambda *args, **kwargs: None


class Collecter():
def __init__(self):
pass

def add(self, texts: list[str]):
pass

def get(self, search_strings: list[str], n_results: int) -> list[str]:
pass
embedder = embedding_functions.SentenceTransformerEmbeddingFunction("sentence-transformers/all-mpnet-base-v2")

def clear(self):
pass


class Embedder():
class ChromaCollector():
def __init__(self):
pass

def embed(self, text: str) -> list[torch.Tensor]:
pass
name = ''.join(random.choice('ab') for _ in range(10))


class ChromaCollector(Collecter):
def __init__(self, embedder: Embedder):
super().__init__()
self.name = name
self.chroma_client = chromadb.Client(Settings(anonymized_telemetry=False))
self.embedder = embedder
self.collection = self.chroma_client.create_collection(name="context", embedding_function=embedder.embed)
self.collection = self.chroma_client.create_collection(name=name, embedding_function=embedder)
self.ids = []

def add(self, texts: list[str]):
Expand Down Expand Up @@ -102,24 +83,15 @@ def get_ids_sorted(self, search_strings: list[str], n_results: int, n_initial: i
return sorted(ids)

def clear(self):
self.collection.delete(ids=self.ids)
self.ids = []


class SentenceTransformerEmbedder(Embedder):
def __init__(self) -> None:
self.model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
self.embed = self.model.encode
self.chroma_client.delete_collection(name=self.name)
self.collection = self.chroma_client.create_collection(name=self.name, embedding_function=embedder)


def make_collector():
global embedder
return ChromaCollector(embedder)
return ChromaCollector()


def add_chunks_to_collector(chunks, collector):
collector.clear()
collector.add(chunks)


embedder = SentenceTransformerEmbedder()
2 changes: 1 addition & 1 deletion extensions/superbooga/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
beautifulsoup4==4.12.2
chromadb==0.3.18
chromadb==0.4.24
pandas==2.0.3
posthog==2.4.2
sentence_transformers==2.2.2
Expand Down
33 changes: 11 additions & 22 deletions extensions/superboogav2/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,16 @@

import json
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from urllib.parse import urlparse, parse_qs
from threading import Thread
from urllib.parse import parse_qs, urlparse

import extensions.superboogav2.parameters as parameters
from modules import shared
from modules.logging_colors import logger

from .chromadb import ChromaCollector
from .data_processor import process_and_add_to_collector

import extensions.superboogav2.parameters as parameters


class CustomThreadingHTTPServer(ThreadingHTTPServer):
def __init__(self, server_address, RequestHandlerClass, collector: ChromaCollector, bind_and_activate=True):
Expand All @@ -38,30 +37,26 @@ def __init__(self, request, client_address, server, collector: ChromaCollector):
self.collector = collector
super().__init__(request, client_address, server)


def _send_412_error(self, message):
self.send_response(412)
self.send_header("Content-type", "application/json")
self.end_headers()
response = json.dumps({"error": message})
self.wfile.write(response.encode('utf-8'))


def _send_404_error(self):
self.send_response(404)
self.send_header("Content-type", "application/json")
self.end_headers()
response = json.dumps({"error": "Resource not found"})
self.wfile.write(response.encode('utf-8'))


def _send_400_error(self, error_message: str):
self.send_response(400)
self.send_header("Content-type", "application/json")
self.end_headers()
response = json.dumps({"error": error_message})
self.wfile.write(response.encode('utf-8'))


def _send_200_response(self, message: str):
self.send_response(200)
Expand All @@ -75,24 +70,21 @@ def _send_200_response(self, message: str):

self.wfile.write(response.encode('utf-8'))


def _handle_get(self, search_strings: list[str], n_results: int, max_token_count: int, sort_param: str):
if sort_param == parameters.SORT_DISTANCE:
results = self.collector.get_sorted_by_dist(search_strings, n_results, max_token_count)
elif sort_param == parameters.SORT_ID:
results = self.collector.get_sorted_by_id(search_strings, n_results, max_token_count)
else: # Default is dist
else: # Default is dist
results = self.collector.get_sorted_by_dist(search_strings, n_results, max_token_count)

return {
"results": results
}


def do_GET(self):
self._send_404_error()


def do_POST(self):
try:
content_length = int(self.headers['Content-Length'])
Expand All @@ -107,7 +99,7 @@ def do_POST(self):
if corpus is None:
self._send_412_error("Missing parameter 'corpus'")
return

clear_before_adding = body.get('clear_before_adding', False)
metadata = body.get('metadata')
process_and_add_to_collector(corpus, self.collector, clear_before_adding, metadata)
Expand All @@ -118,7 +110,7 @@ def do_POST(self):
if corpus is None:
self._send_412_error("Missing parameter 'metadata'")
return

self.collector.delete(ids_to_delete=None, where=metadata)
self._send_200_response("Data successfully deleted")

Expand All @@ -127,15 +119,15 @@ def do_POST(self):
if search_strings is None:
self._send_412_error("Missing parameter 'search_strings'")
return

n_results = body.get('n_results')
if n_results is None:
n_results = parameters.get_chunk_count()

max_token_count = body.get('max_token_count')
if max_token_count is None:
max_token_count = parameters.get_max_token_count()

sort_param = query_params.get('sort', ['distance'])[0]

results = self._handle_get(search_strings, n_results, max_token_count, sort_param)
Expand All @@ -146,7 +138,6 @@ def do_POST(self):
except Exception as e:
self._send_400_error(str(e))


def do_DELETE(self):
try:
parsed_path = urlparse(self.path)
Expand All @@ -161,12 +152,10 @@ def do_DELETE(self):
except Exception as e:
self._send_400_error(str(e))


def do_OPTIONS(self):
self.send_response(200)
self.end_headers()


def end_headers(self):
self.send_header('Access-Control-Allow-Origin', '*')
self.send_header('Access-Control-Allow-Methods', '*')
Expand Down Expand Up @@ -197,11 +186,11 @@ def start_server(self, port: int):

def stop_server(self):
if self.server is not None:
logger.info(f'Stopping chromaDB API.')
logger.info('Stopping chromaDB API.')
self.server.shutdown()
self.server.server_close()
self.server = None
self.is_running = False

def is_server_running(self):
return self.is_running
return self.is_running
14 changes: 7 additions & 7 deletions extensions/superboogav2/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,23 +9,23 @@
import datetime
import json
import os

from pathlib import Path

from .data_processor import process_and_add_to_collector, preprocess_text
from .data_processor import preprocess_text, process_and_add_to_collector
from .parameters import get_chunk_count, get_max_token_count
from .utils import create_metadata_source


def benchmark(config_path, collector):
# Get the current system date
sysdate = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"benchmark_{sysdate}.txt"

# Open the log file in append mode
with open(filename, 'a') as log:
with open(config_path, 'r') as f:
data = json.load(f)

total_points = 0
max_points = 0

Expand All @@ -45,7 +45,7 @@ def benchmark(config_path, collector):
for question_group in item["questions"]:
question_variants = question_group["question_variants"]
criteria = question_group["criteria"]

for q in question_variants:
max_points += len(criteria)
processed_text = preprocess_text(q)
Expand All @@ -54,7 +54,7 @@ def benchmark(config_path, collector):
results = collector.get_sorted_by_dist(processed_text, n_results=get_chunk_count(), max_token_count=get_max_token_count())

points = 0

for c in criteria:
for p in results:
if c in p:
Expand All @@ -69,4 +69,4 @@ def benchmark(config_path, collector):

print(f'##Total points:\n\n{total_points}/{max_points}', file=log)

return total_points, max_points
return total_points, max_points
Loading

0 comments on commit 1934cb6

Please sign in to comment.