Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we get RWKV model family support? #98

Closed
ye7iaserag opened this issue Feb 21, 2023 · 87 comments
Closed

Can we get RWKV model family support? #98

ye7iaserag opened this issue Feb 21, 2023 · 87 comments
Labels
enhancement New feature or request stale

Comments

@ye7iaserag
Copy link
Contributor

A very promising RNN based model, instead of transformers, claiming more lean memory consumption and faster inference

Here is the 14b card
https://huggingface.co/BlinkDL/rwkv-4-pile-14b

@Slug-Cat
Copy link

That would be quite interesting

@oobabooga
Copy link
Owner

The lack of huggingface integration makes this challenging, but it should be possible.

@ye7iaserag
Copy link
Contributor Author

Here is client code from the author https://github.com/BlinkDL/ChatRWKV

@BlinkDL
Copy link

BlinkDL commented Feb 23, 2023

Here is the API for RWKV: https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py
Very simple to use :) Please join RWKV Discord if you have any questions

@BlinkDL
Copy link

BlinkDL commented Feb 23, 2023

ChatRWKV v2: with "stream" and "split" strategies. 3G VRAM is enough to run RWKV 14B :)
ChatRWKV v2 API https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py

os.environ["RWKV_JIT_ON"] = '1'
from rwkv.model import RWKV                         # everything in /v2/rwkv folder
model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-1b5/RWKV-4-Pile-1B5-20220903-8040', strategy='cuda fp16')

out, state = model.forward([187, 510, 1563, 310, 247], None)   # use 20B_tokenizer.json
print(out.detach().cpu().numpy())                   # get logits
out, state = model.forward([187, 510], None)
out, state = model.forward([1563], state)           # RNN has state (use deepcopy if you want to clone it)
out, state = model.forward([310, 247], state)
print(out.detach().cpu().numpy())                   # same result as above

@oobabooga
Copy link
Owner

@BlinkDL is it possible to construct a wrapper to the model that simply does something like this?

prompt = "What I would like to say is: "
completion = model_RWKV.generate(prompt, max_new_tokens=20)
print(completion)

That would be very helpful.

@BlinkDL
Copy link

BlinkDL commented Feb 23, 2023

@BlinkDL is it possible to construct a wrapper to the model that simply does something like this?

prompt = "What I would like to say is: "
completion = model_RWKV.generate(prompt, max_new_tokens=20)
print(completion)

That would be very helpful.

Sure. How about temp, top-p, top-k stuffs?

@oobabooga
Copy link
Owner

Those things are nice, but not essential at first. FlexGen for instance has departed from HuggingFace classes but still provided a model.generate function that was trivial to use for clueless people (like me).

@BlinkDL
Copy link

BlinkDL commented Feb 23, 2023

Those things are nice, but not essential at first. FlexGen for instance has departed from HuggingFace classes but still provided a model.generate function that was trivial to use for clueless people (like me).

updated https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py

def generate(prompt, max_new_tokens, state=None):
    out = ''
    all_tokens = []
    for i in range(max_new_tokens):
        out, state = model.forward(tokenizer.encode(prompt) if i == 0 else [token], state)
        token = tokenizer.sample_logits(out, None, None, temperature=1.0, top_p=0.8)
        all_tokens += [token]
        tmp = tokenizer.decode(all_tokens)
        if '\ufffd' not in tmp: # is it a valid utf-8 string?
            out = tmp
    return out

prompt = "What I would like to say is: "
print(prompt, end='')
completion = generate(prompt, max_new_tokens=20)
print(completion)

Note: it's slow (not optimized yet) when your prompt is long. Better keep it as short as possible (for now).

@BlinkDL
Copy link

BlinkDL commented Feb 23, 2023

I added some tips to https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py

@oobabooga
Copy link
Owner

Very nice, @BlinkDL! Thank you for this. I will try incorporating it into the web UI later.

@oobabooga oobabooga added the enhancement New feature or request label Feb 23, 2023
@oobabooga oobabooga pinned this issue Feb 23, 2023
@BlinkDL
Copy link

BlinkDL commented Feb 25, 2023

Now ChatRWKV can quickly preprocess the context if you set os.environ["RWKV_CUDA_ON"] = '1' before loading it :)
It will use ninja to compile a cuda kernel. You can distribute the compiled kernels to those without compiler.

@oobabooga
Copy link
Owner

oobabooga commented Feb 28, 2023

We have a first successful run: #149

first-run

I am very impressed with the coherence of the model so far, even though I have used the smallest version.

@BlinkDL I have a few questions:

  • What are alpha_frequency and alpha_presence in terms of Hugging Face parameters? These are the parameters that I am using at the moment:

max_new_tokens, do_sample, temperature, top_p, typical_p, repetition_penalty, top_k, min_length, no_repeat_ngram_size, num_beams, penalty_alpha, length_penalty, early_stopping, eos_token

  • Is there a way of adding the library to my project other than cloning the repository and importing from the rwkv folder?

@BlinkDL
Copy link

BlinkDL commented Feb 28, 2023

For alpha_frequency and alpha_presence, see "Frequency and presence penalties":
https://platform.openai.com/docs/api-reference/parameter-details

Please clone and import until I make the pip package :) Too busy at the moment, but I will do it.

@BlinkDL
Copy link

BlinkDL commented Mar 1, 2023

The pip package is here :)
https://pypi.org/project/rwkv/

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 1, 2023

That 3g of memory is all GPU? Or is it offloaded? Would this scale to a 60B model and fit in a single GPU?

@oobabooga
Copy link
Owner

Much easier with the pip package. Some more tweaks and I should be able to merge the PR.

@oobabooga
Copy link
Owner

oobabooga commented Mar 1, 2023

I have merged the PR the RWKV models now work in the web UI. Some comments:

  1. I have made succesful tests with RWKV-4-Pile-7B-20221115-8047.pth and RWKV-4-Pile-169M-20220807-8023.pth. I ran out of GPU memory with RWKV-4-Pile-14B-20230213-8019.pth (RTX 3090).
  2. To install a model, just place .pth and the 20B_tokenizer.json file inside the models folder.
  3. In order to make the process of loading the model more familar, I have created a RWKV module with the following methods: RWKVModel.from_pretrained and RWKVModel.generate. Source code: https://github.com/oobabooga/text-generation-webui/blob/main/modules/RWKV.py
  4. Only the top_p and temperature parameters can be changed in the interface for now. I am reluctant to add alpha_frequency and alpha_presence sliders because these parameters are used exclusively by RWKV and this change would break the API. They are left hardcoded to the default values of 0.25.
  5. It is possible to run the models in CPU mode with --cpu. By default, they are loaded to the GPU.
  6. The best way to try the models is with python server.py --no-stream. That is, without --chat, --cai-chat, etc.
  7. The current implementation should only work on Linux because the rwkv library reads paths as strings. It would be best to accept pathlib.Path variables too.
  8. After loading the model, make sure to lower the temperature to 0.5 to 1 for best results.
  9. The internal usage of the model is required some additional special handling because there is not an explicit tokenizer. In HF, the process is to take an input string, convert it into a sequence of tokens, generate, then convert back to a string.

This is the full code for the PR: https://github.com/oobabooga/text-generation-webui/pull/149/files

There is probably lots of room for improvement.

@BlinkDL
Copy link

BlinkDL commented Mar 1, 2023

Cool. Please use rwkv==0.0.6 (fix a bug with temperature for CPU)

You can use "strategies" to load 14B on 3090. For example 'cuda fp16 *30 -> cpu fp32' [try increasing 30, for better speed, until you run out of VRAM]

@oobabooga
Copy link
Owner

Please use rwkv==0.0.6

Done. I have added a new flag to let users specify their custom strategies manually:

  --rwkv-strategy RWKV_STRATEGY             The strategy to use while loading RWKV models. Examples: "cpu fp32", "cuda fp16", "cuda fp16 *30 -> cpu fp32".

As you suggested, "cuda fp16 *30 -> cpu fp32" allowed me to get a response from the 14b model with 24GB VRAM.

@oobabooga
Copy link
Owner

My streaming implementation is probably really dumb and much slower than it could be:

            for i in range(max_new_tokens//8):
                reply = shared.model.generate(question, token_count=8, temperature=temperature, top_p=top_p)
                yield formatted_outputs(reply, shared.model_name)
                question = reply

It would be best to use the callback argument but I haven't figured out how yet.

@ye7iaserag
Copy link
Contributor Author

@BlinkDL You said earlier ChatRWKV v2: with "stream" and "split" strategies. 3G VRAM is enough to run RWKV 14B
Yet @oobabooga said it went OOM on a 3090 (24GB of vram)
What am I missing?

@BlinkDL
Copy link

BlinkDL commented Mar 2, 2023

@BlinkDL You said earlier ChatRWKV v2: with "stream" and "split" strategies. 3G VRAM is enough to run RWKV 14B Yet @oobabooga said it went OOM on a 3090 (24GB of vram) What am I missing?

see the post above you.
"As you suggested, "cuda fp16 *30 -> cpu fp32" allowed me to get a response from the 14b model with 24GB VRAM."

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 2, 2023

If it replies faster/better than a regular 13b even with the split, it's still something. Plus the faster time to train. But I guess miracles we will not get.

@BlinkDL
Copy link

BlinkDL commented Mar 2, 2023

If it replies faster/better than a regular 13b even with the split, it's still something. Plus the faster time to train. But I guess miracles we will not get.

Why not? RWKV is faster than regular 13b and better on many tasks.

@ye7iaserag
Copy link
Contributor Author

@BlinkDL but its 24gb not 3gb, i really wanned to run that on 3080ti which only has 12gbs

@BlinkDL
Copy link

BlinkDL commented Mar 3, 2023

@BlinkDL but its 24gb not 3gb, i really wanned to run that on 3080ti which only has 12gbs

"cuda fp16 *12 -> cpu fp32" [try increasing 12, for better speed, until you run out of VRAM]

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 3, 2023

Tried the 14b. The coherence is excellent but the reply times were in the 160s range.

Loaded it like this:

python server.py --no-stream --model rwkv-4-pile-14b --rwkv-strategy "cuda fp16 *30 -> cpu fp32"
python server.py --cai-chat --no-stream --model rwkv-4-pile-14b --rwkv-strategy "cuda fp16 *30 -> cpu fp32"

Used about 20gb of vram/20gb ram. During generation it only used a single core and the GPU was mostly idle. Was that a bug?
I think either using more GPU time or cores would tremendously help.

rwkv

@BlinkDL
Copy link

BlinkDL commented Mar 14, 2023

@oobabooga new and better RWKV 14B ctx8192 https://huggingface.co/BlinkDL/rwkv-4-pile-14b/blob/main/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth

(there is 7B ctx8192 too but not as good as ctx4096 when ctxlen < 3k)

@BlinkDL
Copy link

BlinkDL commented Mar 15, 2023

Update ChatRWKV v2 & pip rwkv package (0.5.0) and set os.environ["RWKV_CUDA_ON"] = '1'
for 1.5x speed f16i8 (and 10% less VRAM, now 14686MB for 14B instead of 16462M - so you can put more layers on GPU)
PLEASE VERIFY THE GENERATION QUALITY IS UNCHANGED.
@Ph0rk0z

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 15, 2023

New one is blocked by BlinkDL/ChatRWKV#38

@ye7iaserag
Copy link
Contributor Author

got it working with

python server.py --cai-chat --listen --listen-port=8888 --model=RWKV-4-Pile-14B-20230228-ctx4096-test663.pth --rwkv-cuda-on --rwkv-strategy="cuda fp16i8 *25 -> cpu fp32"

on a 3080ti with 12gb of varam and 64gb of ram
Thanks a lot @BlinkDL and @oobabooga

@BlinkDL
Copy link

BlinkDL commented Mar 17, 2023

on a 3080ti with 12gb of varam and 64gb of ram Thanks a lot @BlinkDL and @oobabooga

Latest model: https://huggingface.co/BlinkDL/rwkv-4-pile-14b/blob/main/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth
:)

@ye7iaserag
Copy link
Contributor Author

on a 3080ti with 12gb of varam and 64gb of ram Thanks a lot @BlinkDL and @oobabooga

Latest model: https://huggingface.co/BlinkDL/rwkv-4-pile-14b/blob/main/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth :)

Yes downloading now after I saw your reddit post, and I asked there what's the difference between this version and the previous one?

@BlinkDL
Copy link

BlinkDL commented Mar 17, 2023

Update ChatRWKV v2 & pip rwkv package (0.6.0):

The os.environ["RWKV_CUDA_ON"] = '1' mode is now 10% faster for f16i8 on all GPUs, and supports old GPUs.

Supports "chunk_len" in PIPELINE_ARGS, which can split long inputs into chunks to save VRAM (shorter -> slower).
Default = 256 @oobabooga

@BlinkDL
Copy link

BlinkDL commented Mar 17, 2023

New one is blocked by BlinkDL/ChatRWKV#38

fixed in latest version :)

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 18, 2023

After I update transformers it's loading using a single core again and it is abysmal. Performance is also as you see since only one CPU is used now.

0.6.0
Output generated in 90.31 seconds (0.13 tokens/s, 12 tokens)
Output generated in 86.88 seconds (0.12 tokens/s, 10 tokens)
Output generated in 23.34 seconds (0.47 tokens/s, 11 tokens)
Output generated in 24.15 seconds (0.41 tokens/s, 10 token

0.4.2
Output generated in 9.00 seconds (1.00 tokens/s, 9 tokens)
Output generated in 5.08 seconds (1.97 tokens/s, 10 tokens)
Output generated in 3.86 seconds (1.56 tokens/s, 6 tokens)
Output generated in 6.61 seconds (2.12 tokens/s, 14 tokens)
Output generated in 17.30 seconds (2.60 tokens/s, 45 tokens)

I even tried clearing the chat :( no help.

@oobabooga oobabooga unpinned this issue Mar 18, 2023
@BlinkDL
Copy link

BlinkDL commented Mar 19, 2023

Update ChatRWKV v2 & pip rwkv package (0.7.0):
Use v2/convert_model.py to convert a model for a strategy, for faster loading & saves CPU RAM.

@Ph0rk0z what is the problematic transformers version

oobabooga added a commit that referenced this issue Mar 19, 2023
@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 19, 2023

Transformers went from using multiple cores during GPTQ and model loading to single core but I did a test and it proves performance is better at least for llama. BTW I was meaning to ask why I can't get deterministic result from RWKV?

I will test the perf of the new update. in a minute.

@BlinkDL
Copy link

BlinkDL commented Mar 19, 2023

BTW I was meaning to ask why I can't get deterministic result from RWKV?

Non-deterministic if you use the CUDA kernel because of non-deterministic float point computation orders

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 19, 2023

Ahh. that makes some sense. Time to check it.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 19, 2023

For my Pascal based card at least there is no difference.

0.7.0 
Output generated in 92.55 seconds (0.23 tokens/s, 21 tokens)
Output generated in 83.81 seconds (0.19 tokens/s, 16 tokens)
Output generated in 84.42 seconds (0.08 tokens/s, 7 tokens)

0.4.2 - today
Output generated in 8.07 seconds (0.87 tokens/s, 7 tokens)
Output generated in 4.38 seconds (1.82 tokens/s, 8 tokens)
Output generated in 3.82 seconds (1.57 tokens/s, 6 tokens)
Output generated in 11.94 seconds (2.51 tokens/s, 30 tokens)

90s is same response time as offloaded 30b model. I think fix for older GPU made it run very slow.

@BlinkDL
Copy link

BlinkDL commented Mar 19, 2023

90s is same response time as offloaded 30b model. I think fix for older GPU made it run very slow.

Please benchmark 0.5.0 0.6.0 0.6.1 0.6.2 so that I can find the culprit :)

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 20, 2023

I have benchmarked before.. Checked and made sure and transformers only affected the model load time.

0.4.2
Output generated in 8.07 seconds (0.87 tokens/s, 7 tokens)
Output generated in 4.38 seconds (1.82 tokens/s, 8 tokens)
Output generated in 3.82 seconds (1.57 tokens/s, 6 tokens)

0.5.0 - does not compile due to compute 6.1

0.6.0 
Output generated in 90.31 seconds (0.13 tokens/s, 12 tokens)
Output generated in 86.88 seconds (0.12 tokens/s, 10 tokens)
Output generated in 23.34 seconds (0.47 tokens/s, 11 tokens)

0.7.0
Output generated in 92.55 seconds (0.23 tokens/s, 21 tokens)
Output generated in 83.81 seconds (0.19 tokens/s, 16 tokens)
Output generated in 84.42 seconds (0.08 tokens/s, 7 tokens)

The problem started when you changed the int8 multiplication to something unsupported. The fix for that is too slow so I have been using 0.4.2.

@Slug-Cat
Copy link

Slug-Cat commented Mar 20, 2023

I am unable to use the --rwkv-cuda-on option, as I get this error:

  File "E:\oobabooga\text-generation-webui\server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "E:\oobabooga\text-generation-webui\modules\models.py", line 90, in load_model
    from modules.RWKV import RWKVModel, RWKVTokenizer
  File "E:\oobabooga\text-generation-webui\modules\RWKV.py", line 15, in <module>
    from rwkv.model import RWKV
  File "E:\oobabooga\installer_files\env\lib\site-packages\rwkv\model.py", line 29, in <module>
    load(
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1508, in _jit_compile
    _write_ninja_file_and_build_library(
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1592, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1648, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions```

@BlinkDL
Copy link

BlinkDL commented Mar 20, 2023

I am unable to use the --rwkv-cuda-on option, as I get this error:

  File "E:\oobabooga\text-generation-webui\server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "E:\oobabooga\text-generation-webui\modules\models.py", line 90, in load_model
    from modules.RWKV import RWKVModel, RWKVTokenizer
  File "E:\oobabooga\text-generation-webui\modules\RWKV.py", line 15, in <module>
    from rwkv.model import RWKV
  File "E:\oobabooga\installer_files\env\lib\site-packages\rwkv\model.py", line 29, in <module>
    load(
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1508, in _jit_compile
    _write_ninja_file_and_build_library(
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1592, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1648, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions```

From ChatRWKV readme:

Note RWKV_CUDA_ON will build a CUDA kernel ("pip install ninja" first).

How to build in Linux: set these and run v2/chat.py

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

How to build in win:

Install VS2022 build tools (https://aka.ms/vs/17/release/vs_BuildTools.exe select Desktop C++). Reinstall CUDA 11.7 (install VC++ extensions). Run v2/chat.py in "x64 native tools command prompt".

@BlinkDL
Copy link

BlinkDL commented Mar 20, 2023

The problem started when you changed the int8 multiplication to something unsupported. The fix for that is too slow so I have been using 0.4.2.

We will try a fp32i8 kernel soon (similar VRAM usage as fp16i8)

@BarfingLemurs
Copy link

can we use this on this webui with longer context length than 2048? I would like to see if I could summarize a long article.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 21, 2023

Yes.. just rename the settings template and change "chat_prompt_size_max": 2048, to your context. As long as nothing holds it up on the RWKV module end it should be good to go and was in the past.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 21, 2023

I tried to benchmark more methodically the differences between 0.4.2 and 0.7.2 There are some gain and some losses depending on the strategy that is being used. I finally understand what this is doing a little bit better. Of course ideally it would be a deterministic prompt but since that is not possible:

0.7.0

--rwkv-strategy "cuda fp16i8"
* cuda [float16, uint8], store 41 layers
Output generated in 82.33 seconds (2.43 tokens/s, 200 tokens)
Output generated in 58.73 seconds (1.14 tokens/s, 67 tokens) << First reply on load.
Output generated in 52.85 seconds (0.91 tokens/s, 48 tokens)
Output generated in 53.30 seconds (0.96 tokens/s, 51 tokens)
Output generated in 82.34 seconds (2.43 tokens/s, 200 tokens)
Output generated in 59.69 seconds (1.37 tokens/s, 82 tokens)
Output generated in 81.81 seconds (2.43 tokens/s, 199 tokens)
Output generated in 55.32 seconds (1.10 tokens/s, 61 tokens)
Output generated in 53.32 seconds (0.96 tokens/s, 51 tokens)

--rwkv-strategy "cuda fp16 *22 -> cuda fp16 *0+"
* cuda [float16, float16], store 22 layers
* cuda [float16, float16], store 0 layers, stream 19 layers
Output generated in 69.03 seconds (0.87 tokens/s, 60 tokens)

--rwkv-strategy "cuda fp16i8 *22 -> cuda fp16 *0+"
* cuda [float16, uint8], store 22 layers
* cuda [float16, float16], store 0 layers, stream 19 layers
Output generated in 76.59 seconds (0.55 tokens/s, 42 tokens)

--rwkv-strategy "cuda fp16i8 *22 -> cuda fp16 *19"
* cuda [float16, uint8], store 22 layers
* cuda [float16, float16], store 19 layers
Output generated in 46.00 seconds (1.91 tokens/s, 88 tokens)
Output generated in 36.95 seconds (1.62 tokens/s, 60 tokens)
Output generated in 42.35 seconds (1.89 tokens/s, 80 tokens)
Output generated in 59.76 seconds (3.35 tokens/s, 200 tokens) <<< Wish it like this all the time!
Output generated in 37.25 seconds (1.91 tokens/s, 71 tokens)
Output generated in 30.76 seconds (1.14 tokens/s, 35 tokens)
Output generated in 32.06 seconds (1.31 tokens/s, 42 tokens)
Output generated in 35.35 seconds (1.73 tokens/s, 61 tokens

0.4.2

--rwkv-strategy "cuda fp16i8"
* cuda [float16, uint8], store 41 layers
Output generated in 45.99 seconds (1.72 tokens/s, 79 tokens)
Output generated in 105.55 seconds (1.89 tokens/s, 200 tokens)
Output generated in 30.12 seconds (1.69 tokens/s, 51 tokens) << Initial after load.
Output generated in 39.92 seconds (1.95 tokens/s, 78 tokens)
Output generated in 99.93 seconds (2.00 tokens/s, 200 tokens)
Output generated in 32.15 seconds (1.93 tokens/s, 62 tokens)

--rwkv-strategy "cuda fp16 *22 -> cuda fp16 *0+"
* cuda [float16, float16], store 22 layers
* cuda [float16, float16], store 0 layers, stream 19 layers
Output generated in 42.51 seconds (0.82 tokens/s, 35 tokens)
Output generated in 220.15 seconds (0.91 tokens/s, 200 tokens)

--rwkv-strategy "cuda fp16i8 *22 -> cuda fp16 *0+"
* cuda [float16, uint8], store 22 layers
* cuda [float16, float16], store 0 layers, stream 19 layers
Output generated in 76.79 seconds (0.73 tokens/s, 56 tokens)
Output generated in 90.75 seconds (0.77 tokens/s, 70 tokens)

--rwkv-strategy "cuda fp16i8 *22 -> cuda fp16 *19"
* cuda [float16, uint8], store 22 layers
* cuda [float16, float16], store 19 layers
Output generated in 72.45 seconds (2.76 tokens/s, 200 tokens)
Output generated in 69.10 seconds (2.89 tokens/s, 200 tokens)
Output generated in 69.06 seconds (2.90 tokens/s, 200 tokens)
Output generated in 74.14 seconds (2.70 tokens/s, 200 tokens) <<< Initial generation on reload.
Output generated in 25.76 seconds (2.76 tokens/s, 71 tokens)
Output generated in 25.62 seconds (2.77 tokens/s, 71 tokens)
Output generated in 69.20 seconds (2.89 tokens/s, 200 tokens)
Output generated in 19.56 seconds (2.71 tokens/s, 53 tokens) <<< Stop generation at new line again.
Output generated in 24.54 seconds (2.77 tokens/s, 68 tokens)
Output generated in 18.25 seconds (2.68 tokens/s, 49 tokens)
Output generated in 21.89 seconds (2.74 tokens/s, 60 tokens)

All have the same initial context size of just the character and 1 message. Importance is of course overall it/s and initial time to reply since this is used as chat. I shoot for highest token count to try to get a stable speed. Also latest commits to this repo made some changes to speed. Module loads using all cores again.

As you see, top speed is better in 0.7.0 but much worse average speed. Wait to start generating every time feels worse too. Not sure how to measure that.

Ph0rk0z referenced this issue in Ph0rk0z/text-generation-webui-testing Apr 17, 2023
@github-actions github-actions bot added the stale label Apr 21, 2023
@github-actions
Copy link

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

8 participants