Can we get RWKV model family support? #98

ye7iaserag · 2023-02-21T08:19:01Z

A very promising RNN based model, instead of transformers, claiming more lean memory consumption and faster inference

Here is the 14b card
https://huggingface.co/BlinkDL/rwkv-4-pile-14b

Slug-Cat · 2023-02-21T19:13:18Z

That would be quite interesting

oobabooga · 2023-02-22T00:05:23Z

The lack of huggingface integration makes this challenging, but it should be possible.

ye7iaserag · 2023-02-22T19:59:38Z

Here is client code from the author https://github.com/BlinkDL/ChatRWKV

BlinkDL · 2023-02-23T12:31:03Z

Here is the API for RWKV: https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py
Very simple to use :) Please join RWKV Discord if you have any questions

BlinkDL · 2023-02-23T19:27:04Z

ChatRWKV v2: with "stream" and "split" strategies. 3G VRAM is enough to run RWKV 14B :)
ChatRWKV v2 API https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py

os.environ["RWKV_JIT_ON"] = '1'
from rwkv.model import RWKV                         # everything in /v2/rwkv folder
model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-1b5/RWKV-4-Pile-1B5-20220903-8040', strategy='cuda fp16')

out, state = model.forward([187, 510, 1563, 310, 247], None)   # use 20B_tokenizer.json
print(out.detach().cpu().numpy())                   # get logits
out, state = model.forward([187, 510], None)
out, state = model.forward([1563], state)           # RNN has state (use deepcopy if you want to clone it)
out, state = model.forward([310, 247], state)
print(out.detach().cpu().numpy())                   # same result as above

oobabooga · 2023-02-23T19:35:39Z

@BlinkDL is it possible to construct a wrapper to the model that simply does something like this?

prompt = "What I would like to say is: "
completion = model_RWKV.generate(prompt, max_new_tokens=20)
print(completion)

That would be very helpful.

BlinkDL · 2023-02-23T19:59:44Z

@BlinkDL is it possible to construct a wrapper to the model that simply does something like this?
prompt = "What I would like to say is: "
completion = model_RWKV.generate(prompt, max_new_tokens=20)
print(completion)
That would be very helpful.

Sure. How about temp, top-p, top-k stuffs?

oobabooga · 2023-02-23T20:04:16Z

Those things are nice, but not essential at first. FlexGen for instance has departed from HuggingFace classes but still provided a model.generate function that was trivial to use for clueless people (like me).

BlinkDL · 2023-02-23T20:42:49Z

Those things are nice, but not essential at first. FlexGen for instance has departed from HuggingFace classes but still provided a model.generate function that was trivial to use for clueless people (like me).

updated https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py

def generate(prompt, max_new_tokens, state=None):
    out = ''
    all_tokens = []
    for i in range(max_new_tokens):
        out, state = model.forward(tokenizer.encode(prompt) if i == 0 else [token], state)
        token = tokenizer.sample_logits(out, None, None, temperature=1.0, top_p=0.8)
        all_tokens += [token]
        tmp = tokenizer.decode(all_tokens)
        if '\ufffd' not in tmp: # is it a valid utf-8 string?
            out = tmp
    return out

prompt = "What I would like to say is: "
print(prompt, end='')
completion = generate(prompt, max_new_tokens=20)
print(completion)

Note: it's slow (not optimized yet) when your prompt is long. Better keep it as short as possible (for now).

BlinkDL · 2023-02-23T21:00:52Z

I added some tips to https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py

oobabooga · 2023-02-23T21:03:04Z

Very nice, @BlinkDL! Thank you for this. I will try incorporating it into the web UI later.

BlinkDL · 2023-02-25T06:00:02Z

Now ChatRWKV can quickly preprocess the context if you set os.environ["RWKV_CUDA_ON"] = '1' before loading it :)
It will use ninja to compile a cuda kernel. You can distribute the compiled kernels to those without compiler.

oobabooga · 2023-02-28T02:21:16Z

We have a first successful run: #149

I am very impressed with the coherence of the model so far, even though I have used the smallest version.

@BlinkDL I have a few questions:

What are alpha_frequency and alpha_presence in terms of Hugging Face parameters? These are the parameters that I am using at the moment:

max_new_tokens, do_sample, temperature, top_p, typical_p, repetition_penalty, top_k, min_length, no_repeat_ngram_size, num_beams, penalty_alpha, length_penalty, early_stopping, eos_token

Is there a way of adding the library to my project other than cloning the repository and importing from the rwkv folder?

BlinkDL · 2023-02-28T06:26:53Z

For alpha_frequency and alpha_presence, see "Frequency and presence penalties":
https://platform.openai.com/docs/api-reference/parameter-details

Please clone and import until I make the pip package :) Too busy at the moment, but I will do it.

BlinkDL · 2023-03-01T08:47:19Z

The pip package is here :)
https://pypi.org/project/rwkv/

Ph0rk0z · 2023-03-01T15:17:32Z

That 3g of memory is all GPU? Or is it offloaded? Would this scale to a 60B model and fit in a single GPU?

oobabooga · 2023-03-01T15:20:43Z

Much easier with the pip package. Some more tweaks and I should be able to merge the PR.

oobabooga · 2023-03-01T20:08:22Z

I have merged the PR the RWKV models now work in the web UI. Some comments:

I have made succesful tests with RWKV-4-Pile-7B-20221115-8047.pth and RWKV-4-Pile-169M-20220807-8023.pth. I ran out of GPU memory with RWKV-4-Pile-14B-20230213-8019.pth (RTX 3090).
To install a model, just place .pth and the 20B_tokenizer.json file inside the models folder.
In order to make the process of loading the model more familar, I have created a RWKV module with the following methods: RWKVModel.from_pretrained and RWKVModel.generate. Source code: https://github.com/oobabooga/text-generation-webui/blob/main/modules/RWKV.py
Only the top_p and temperature parameters can be changed in the interface for now. I am reluctant to add alpha_frequency and alpha_presence sliders because these parameters are used exclusively by RWKV and this change would break the API. They are left hardcoded to the default values of 0.25.
It is possible to run the models in CPU mode with --cpu. By default, they are loaded to the GPU.
The best way to try the models is with python server.py --no-stream. That is, without --chat, --cai-chat, etc.
The current implementation should only work on Linux because the rwkv library reads paths as strings. It would be best to accept pathlib.Path variables too.
After loading the model, make sure to lower the temperature to 0.5 to 1 for best results.
The internal usage of the model is required some additional special handling because there is not an explicit tokenizer. In HF, the process is to take an input string, convert it into a sequence of tokens, generate, then convert back to a string.

This is the full code for the PR: https://github.com/oobabooga/text-generation-webui/pull/149/files

There is probably lots of room for improvement.

BlinkDL · 2023-03-01T20:35:37Z

Cool. Please use rwkv==0.0.6 (fix a bug with temperature for CPU)

You can use "strategies" to load 14B on 3090. For example 'cuda fp16 *30 -> cpu fp32' [try increasing 30, for better speed, until you run out of VRAM]

oobabooga · 2023-03-01T23:05:12Z

Please use rwkv==0.0.6

Done. I have added a new flag to let users specify their custom strategies manually:

  --rwkv-strategy RWKV_STRATEGY             The strategy to use while loading RWKV models. Examples: "cpu fp32", "cuda fp16", "cuda fp16 *30 -> cpu fp32".

As you suggested, "cuda fp16 *30 -> cpu fp32" allowed me to get a response from the 14b model with 24GB VRAM.

oobabooga · 2023-03-01T23:08:32Z

My streaming implementation is probably really dumb and much slower than it could be:

            for i in range(max_new_tokens//8):
                reply = shared.model.generate(question, token_count=8, temperature=temperature, top_p=top_p)
                yield formatted_outputs(reply, shared.model_name)
                question = reply

It would be best to use the callback argument but I haven't figured out how yet.

ye7iaserag · 2023-03-02T06:38:11Z

@BlinkDL You said earlier ChatRWKV v2: with "stream" and "split" strategies. 3G VRAM is enough to run RWKV 14B
Yet @oobabooga said it went OOM on a 3090 (24GB of vram)
What am I missing?

BlinkDL · 2023-03-02T06:57:57Z

@BlinkDL You said earlier ChatRWKV v2: with "stream" and "split" strategies. 3G VRAM is enough to run RWKV 14B Yet @oobabooga said it went OOM on a 3090 (24GB of vram) What am I missing?

see the post above you.
"As you suggested, "cuda fp16 *30 -> cpu fp32" allowed me to get a response from the 14b model with 24GB VRAM."

Ph0rk0z · 2023-03-02T12:30:47Z

If it replies faster/better than a regular 13b even with the split, it's still something. Plus the faster time to train. But I guess miracles we will not get.

BlinkDL · 2023-03-02T16:25:22Z

If it replies faster/better than a regular 13b even with the split, it's still something. Plus the faster time to train. But I guess miracles we will not get.

Why not? RWKV is faster than regular 13b and better on many tasks.

ye7iaserag · 2023-03-02T21:23:54Z

@BlinkDL but its 24gb not 3gb, i really wanned to run that on 3080ti which only has 12gbs

BlinkDL · 2023-03-03T02:21:37Z

@BlinkDL but its 24gb not 3gb, i really wanned to run that on 3080ti which only has 12gbs

"cuda fp16 *12 -> cpu fp32" [try increasing 12, for better speed, until you run out of VRAM]

Ph0rk0z · 2023-03-03T04:46:01Z

Tried the 14b. The coherence is excellent but the reply times were in the 160s range.

Loaded it like this:

python server.py --no-stream --model rwkv-4-pile-14b --rwkv-strategy "cuda fp16 *30 -> cpu fp32"
python server.py --cai-chat --no-stream --model rwkv-4-pile-14b --rwkv-strategy "cuda fp16 *30 -> cpu fp32"

Used about 20gb of vram/20gb ram. During generation it only used a single core and the GPU was mostly idle. Was that a bug?
I think either using more GPU time or cores would tremendously help.

BlinkDL · 2023-03-14T08:56:41Z

@oobabooga new and better RWKV 14B ctx8192 https://huggingface.co/BlinkDL/rwkv-4-pile-14b/blob/main/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth

(there is 7B ctx8192 too but not as good as ctx4096 when ctxlen < 3k)

BlinkDL · 2023-03-15T06:27:30Z

Update ChatRWKV v2 & pip rwkv package (0.5.0) and set os.environ["RWKV_CUDA_ON"] = '1'
for 1.5x speed f16i8 (and 10% less VRAM, now 14686MB for 14B instead of 16462M - so you can put more layers on GPU)
PLEASE VERIFY THE GENERATION QUALITY IS UNCHANGED.
@Ph0rk0z

Ph0rk0z · 2023-03-15T17:09:15Z

New one is blocked by BlinkDL/ChatRWKV#38

ye7iaserag · 2023-03-16T19:44:07Z

got it working with

python server.py --cai-chat --listen --listen-port=8888 --model=RWKV-4-Pile-14B-20230228-ctx4096-test663.pth --rwkv-cuda-on --rwkv-strategy="cuda fp16i8 *25 -> cpu fp32"

on a 3080ti with 12gb of varam and 64gb of ram
Thanks a lot @BlinkDL and @oobabooga

BlinkDL · 2023-03-17T04:02:30Z

on a 3080ti with 12gb of varam and 64gb of ram Thanks a lot @BlinkDL and @oobabooga

Latest model: https://huggingface.co/BlinkDL/rwkv-4-pile-14b/blob/main/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth
:)

ye7iaserag · 2023-03-17T05:29:27Z

on a 3080ti with 12gb of varam and 64gb of ram Thanks a lot @BlinkDL and @oobabooga

Latest model: https://huggingface.co/BlinkDL/rwkv-4-pile-14b/blob/main/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth :)

Yes downloading now after I saw your reddit post, and I asked there what's the difference between this version and the previous one?

BlinkDL · 2023-03-17T23:20:40Z

Update ChatRWKV v2 & pip rwkv package (0.6.0):

The os.environ["RWKV_CUDA_ON"] = '1' mode is now 10% faster for f16i8 on all GPUs, and supports old GPUs.

Supports "chunk_len" in PIPELINE_ARGS, which can split long inputs into chunks to save VRAM (shorter -> slower).
Default = 256 @oobabooga

BlinkDL · 2023-03-17T23:21:05Z

New one is blocked by BlinkDL/ChatRWKV#38

fixed in latest version :)

Ph0rk0z · 2023-03-18T14:54:15Z

After I update transformers it's loading using a single core again and it is abysmal. Performance is also as you see since only one CPU is used now.

0.6.0
Output generated in 90.31 seconds (0.13 tokens/s, 12 tokens)
Output generated in 86.88 seconds (0.12 tokens/s, 10 tokens)
Output generated in 23.34 seconds (0.47 tokens/s, 11 tokens)
Output generated in 24.15 seconds (0.41 tokens/s, 10 token

0.4.2
Output generated in 9.00 seconds (1.00 tokens/s, 9 tokens)
Output generated in 5.08 seconds (1.97 tokens/s, 10 tokens)
Output generated in 3.86 seconds (1.56 tokens/s, 6 tokens)
Output generated in 6.61 seconds (2.12 tokens/s, 14 tokens)
Output generated in 17.30 seconds (2.60 tokens/s, 45 tokens)

I even tried clearing the chat :( no help.

BlinkDL · 2023-03-19T14:42:23Z

Update ChatRWKV v2 & pip rwkv package (0.7.0):
Use v2/convert_model.py to convert a model for a strategy, for faster loading & saves CPU RAM.

@Ph0rk0z what is the problematic transformers version

Ph0rk0z · 2023-03-19T16:37:27Z

Transformers went from using multiple cores during GPTQ and model loading to single core but I did a test and it proves performance is better at least for llama. BTW I was meaning to ask why I can't get deterministic result from RWKV?

I will test the perf of the new update. in a minute.

BlinkDL · 2023-03-19T17:09:34Z

BTW I was meaning to ask why I can't get deterministic result from RWKV?

Non-deterministic if you use the CUDA kernel because of non-deterministic float point computation orders

Ph0rk0z · 2023-03-19T17:20:58Z

Ahh. that makes some sense. Time to check it.

Ph0rk0z · 2023-03-19T18:45:17Z

For my Pascal based card at least there is no difference.

0.7.0 
Output generated in 92.55 seconds (0.23 tokens/s, 21 tokens)
Output generated in 83.81 seconds (0.19 tokens/s, 16 tokens)
Output generated in 84.42 seconds (0.08 tokens/s, 7 tokens)

0.4.2 - today
Output generated in 8.07 seconds (0.87 tokens/s, 7 tokens)
Output generated in 4.38 seconds (1.82 tokens/s, 8 tokens)
Output generated in 3.82 seconds (1.57 tokens/s, 6 tokens)
Output generated in 11.94 seconds (2.51 tokens/s, 30 tokens)

90s is same response time as offloaded 30b model. I think fix for older GPU made it run very slow.

BlinkDL · 2023-03-19T22:30:18Z

90s is same response time as offloaded 30b model. I think fix for older GPU made it run very slow.

Please benchmark 0.5.0 0.6.0 0.6.1 0.6.2 so that I can find the culprit :)

Ph0rk0z · 2023-03-20T12:11:27Z

I have benchmarked before.. Checked and made sure and transformers only affected the model load time.

0.4.2
Output generated in 8.07 seconds (0.87 tokens/s, 7 tokens)
Output generated in 4.38 seconds (1.82 tokens/s, 8 tokens)
Output generated in 3.82 seconds (1.57 tokens/s, 6 tokens)

0.5.0 - does not compile due to compute 6.1

0.6.0 
Output generated in 90.31 seconds (0.13 tokens/s, 12 tokens)
Output generated in 86.88 seconds (0.12 tokens/s, 10 tokens)
Output generated in 23.34 seconds (0.47 tokens/s, 11 tokens)

0.7.0
Output generated in 92.55 seconds (0.23 tokens/s, 21 tokens)
Output generated in 83.81 seconds (0.19 tokens/s, 16 tokens)
Output generated in 84.42 seconds (0.08 tokens/s, 7 tokens)

The problem started when you changed the int8 multiplication to something unsupported. The fix for that is too slow so I have been using 0.4.2.

Slug-Cat · 2023-03-20T16:33:22Z

I am unable to use the --rwkv-cuda-on option, as I get this error:

  File "E:\oobabooga\text-generation-webui\server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "E:\oobabooga\text-generation-webui\modules\models.py", line 90, in load_model
    from modules.RWKV import RWKVModel, RWKVTokenizer
  File "E:\oobabooga\text-generation-webui\modules\RWKV.py", line 15, in <module>
    from rwkv.model import RWKV
  File "E:\oobabooga\installer_files\env\lib\site-packages\rwkv\model.py", line 29, in <module>
    load(
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1508, in _jit_compile
    _write_ninja_file_and_build_library(
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1592, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1648, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions```

BlinkDL · 2023-03-20T16:34:57Z

I am unable to use the --rwkv-cuda-on option, as I get this error:

  File "E:\oobabooga\text-generation-webui\server.py", line 241, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "E:\oobabooga\text-generation-webui\modules\models.py", line 90, in load_model
    from modules.RWKV import RWKVModel, RWKVTokenizer
  File "E:\oobabooga\text-generation-webui\modules\RWKV.py", line 15, in <module>
    from rwkv.model import RWKV
  File "E:\oobabooga\installer_files\env\lib\site-packages\rwkv\model.py", line 29, in <module>
    load(
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1508, in _jit_compile
    _write_ninja_file_and_build_library(
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1592, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "E:\oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py", line 1648, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions```

From ChatRWKV readme:

Note RWKV_CUDA_ON will build a CUDA kernel ("pip install ninja" first).

How to build in Linux: set these and run v2/chat.py

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

How to build in win:

Install VS2022 build tools (https://aka.ms/vs/17/release/vs_BuildTools.exe select Desktop C++). Reinstall CUDA 11.7 (install VC++ extensions). Run v2/chat.py in "x64 native tools command prompt".

BlinkDL · 2023-03-20T16:36:00Z

The problem started when you changed the int8 multiplication to something unsupported. The fix for that is too slow so I have been using 0.4.2.

We will try a fp32i8 kernel soon (similar VRAM usage as fp16i8)

BarfingLemurs · 2023-03-20T22:18:47Z

can we use this on this webui with longer context length than 2048? I would like to see if I could summarize a long article.

Ph0rk0z · 2023-03-21T11:39:46Z

Yes.. just rename the settings template and change "chat_prompt_size_max": 2048, to your context. As long as nothing holds it up on the RWKV module end it should be good to go and was in the past.

Ph0rk0z · 2023-03-21T14:50:54Z

I tried to benchmark more methodically the differences between 0.4.2 and 0.7.2 There are some gain and some losses depending on the strategy that is being used. I finally understand what this is doing a little bit better. Of course ideally it would be a deterministic prompt but since that is not possible:

0.7.0

--rwkv-strategy "cuda fp16i8"
* cuda [float16, uint8], store 41 layers
Output generated in 82.33 seconds (2.43 tokens/s, 200 tokens)
Output generated in 58.73 seconds (1.14 tokens/s, 67 tokens) << First reply on load.
Output generated in 52.85 seconds (0.91 tokens/s, 48 tokens)
Output generated in 53.30 seconds (0.96 tokens/s, 51 tokens)
Output generated in 82.34 seconds (2.43 tokens/s, 200 tokens)
Output generated in 59.69 seconds (1.37 tokens/s, 82 tokens)
Output generated in 81.81 seconds (2.43 tokens/s, 199 tokens)
Output generated in 55.32 seconds (1.10 tokens/s, 61 tokens)
Output generated in 53.32 seconds (0.96 tokens/s, 51 tokens)

--rwkv-strategy "cuda fp16 *22 -> cuda fp16 *0+"
* cuda [float16, float16], store 22 layers
* cuda [float16, float16], store 0 layers, stream 19 layers
Output generated in 69.03 seconds (0.87 tokens/s, 60 tokens)

--rwkv-strategy "cuda fp16i8 *22 -> cuda fp16 *0+"
* cuda [float16, uint8], store 22 layers
* cuda [float16, float16], store 0 layers, stream 19 layers
Output generated in 76.59 seconds (0.55 tokens/s, 42 tokens)

--rwkv-strategy "cuda fp16i8 *22 -> cuda fp16 *19"
* cuda [float16, uint8], store 22 layers
* cuda [float16, float16], store 19 layers
Output generated in 46.00 seconds (1.91 tokens/s, 88 tokens)
Output generated in 36.95 seconds (1.62 tokens/s, 60 tokens)
Output generated in 42.35 seconds (1.89 tokens/s, 80 tokens)
Output generated in 59.76 seconds (3.35 tokens/s, 200 tokens) <<< Wish it like this all the time!
Output generated in 37.25 seconds (1.91 tokens/s, 71 tokens)
Output generated in 30.76 seconds (1.14 tokens/s, 35 tokens)
Output generated in 32.06 seconds (1.31 tokens/s, 42 tokens)
Output generated in 35.35 seconds (1.73 tokens/s, 61 tokens

0.4.2

--rwkv-strategy "cuda fp16i8"
* cuda [float16, uint8], store 41 layers
Output generated in 45.99 seconds (1.72 tokens/s, 79 tokens)
Output generated in 105.55 seconds (1.89 tokens/s, 200 tokens)
Output generated in 30.12 seconds (1.69 tokens/s, 51 tokens) << Initial after load.
Output generated in 39.92 seconds (1.95 tokens/s, 78 tokens)
Output generated in 99.93 seconds (2.00 tokens/s, 200 tokens)
Output generated in 32.15 seconds (1.93 tokens/s, 62 tokens)

--rwkv-strategy "cuda fp16 *22 -> cuda fp16 *0+"
* cuda [float16, float16], store 22 layers
* cuda [float16, float16], store 0 layers, stream 19 layers
Output generated in 42.51 seconds (0.82 tokens/s, 35 tokens)
Output generated in 220.15 seconds (0.91 tokens/s, 200 tokens)

--rwkv-strategy "cuda fp16i8 *22 -> cuda fp16 *0+"
* cuda [float16, uint8], store 22 layers
* cuda [float16, float16], store 0 layers, stream 19 layers
Output generated in 76.79 seconds (0.73 tokens/s, 56 tokens)
Output generated in 90.75 seconds (0.77 tokens/s, 70 tokens)

--rwkv-strategy "cuda fp16i8 *22 -> cuda fp16 *19"
* cuda [float16, uint8], store 22 layers
* cuda [float16, float16], store 19 layers
Output generated in 72.45 seconds (2.76 tokens/s, 200 tokens)
Output generated in 69.10 seconds (2.89 tokens/s, 200 tokens)
Output generated in 69.06 seconds (2.90 tokens/s, 200 tokens)
Output generated in 74.14 seconds (2.70 tokens/s, 200 tokens) <<< Initial generation on reload.
Output generated in 25.76 seconds (2.76 tokens/s, 71 tokens)
Output generated in 25.62 seconds (2.77 tokens/s, 71 tokens)
Output generated in 69.20 seconds (2.89 tokens/s, 200 tokens)
Output generated in 19.56 seconds (2.71 tokens/s, 53 tokens) <<< Stop generation at new line again.
Output generated in 24.54 seconds (2.77 tokens/s, 68 tokens)
Output generated in 18.25 seconds (2.68 tokens/s, 49 tokens)
Output generated in 21.89 seconds (2.74 tokens/s, 60 tokens)

All have the same initial context size of just the character and 1 message. Importance is of course overall it/s and initial time to reply since this is used as chat. I shoot for highest token count to try to get a stable speed. Also latest commits to this repo made some changes to speed. Module loads using all cores again.

As you see, top speed is better in 0.7.0 but much worse average speed. Wait to start generating every time feels worse too. Not sure how to measure that.

github-actions · 2023-04-21T23:16:29Z

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

oobabooga added the enhancement New feature or request label Feb 23, 2023

oobabooga pinned this issue Feb 23, 2023

oobabooga unpinned this issue Mar 18, 2023

oobabooga added a commit that referenced this issue Mar 19, 2023

Add back RWKV dependency #98

7073e96

Ph0rk0z referenced this issue in Ph0rk0z/text-generation-webui-testing Apr 17, 2023

Add back RWKV dependency #98

6d5bbb2

github-actions bot added the stale label Apr 21, 2023

github-actions bot closed this as completed Apr 21, 2023

mironkraft mentioned this issue May 19, 2023

python3 server.py --chat Ilegal instruction ('core' generated) #2163

Closed

1 task

oobabooga pushed a commit that referenced this issue Sep 22, 2023

Add AMD GPU support for Linux (#98)

1e3c950

Can we get RWKV model family support? #98

Can we get RWKV model family support? #98

Comments

ye7iaserag commented Feb 21, 2023

Slug-Cat commented Feb 21, 2023

oobabooga commented Feb 22, 2023

ye7iaserag commented Feb 22, 2023

BlinkDL commented Feb 23, 2023 • edited Loading

BlinkDL commented Feb 23, 2023 • edited Loading

oobabooga commented Feb 23, 2023

BlinkDL commented Feb 23, 2023

oobabooga commented Feb 23, 2023

BlinkDL commented Feb 23, 2023 • edited Loading

BlinkDL commented Feb 23, 2023

oobabooga commented Feb 23, 2023

BlinkDL commented Feb 25, 2023 • edited Loading

oobabooga commented Feb 28, 2023 • edited Loading

BlinkDL commented Feb 28, 2023

BlinkDL commented Mar 1, 2023

Ph0rk0z commented Mar 1, 2023

oobabooga commented Mar 1, 2023

oobabooga commented Mar 1, 2023 • edited Loading

BlinkDL commented Mar 1, 2023 • edited Loading

oobabooga commented Mar 1, 2023

oobabooga commented Mar 1, 2023

ye7iaserag commented Mar 2, 2023

BlinkDL commented Mar 2, 2023

Ph0rk0z commented Mar 2, 2023

BlinkDL commented Mar 2, 2023

ye7iaserag commented Mar 2, 2023

BlinkDL commented Mar 3, 2023

Ph0rk0z commented Mar 3, 2023

BlinkDL commented Mar 14, 2023 • edited Loading

BlinkDL commented Mar 15, 2023 • edited Loading

Ph0rk0z commented Mar 15, 2023

ye7iaserag commented Mar 16, 2023

BlinkDL commented Mar 17, 2023 • edited Loading

ye7iaserag commented Mar 17, 2023

BlinkDL commented Mar 17, 2023

BlinkDL commented Mar 17, 2023

Ph0rk0z commented Mar 18, 2023

BlinkDL commented Mar 19, 2023

Ph0rk0z commented Mar 19, 2023

BlinkDL commented Mar 19, 2023

Ph0rk0z commented Mar 19, 2023

Ph0rk0z commented Mar 19, 2023

BlinkDL commented Mar 19, 2023 • edited Loading

Ph0rk0z commented Mar 20, 2023

Slug-Cat commented Mar 20, 2023 • edited Loading

BlinkDL commented Mar 20, 2023

Note RWKV_CUDA_ON will build a CUDA kernel ("pip install ninja" first).

How to build in Linux: set these and run v2/chat.py

How to build in win:

BlinkDL commented Mar 20, 2023

BarfingLemurs commented Mar 20, 2023

Ph0rk0z commented Mar 21, 2023 • edited Loading

Ph0rk0z commented Mar 21, 2023 • edited Loading

github-actions bot commented Apr 21, 2023

BlinkDL commented Feb 23, 2023 •

edited

Loading

BlinkDL commented Feb 23, 2023 •

edited

Loading

BlinkDL commented Feb 23, 2023 •

edited

Loading

BlinkDL commented Feb 25, 2023 •

edited

Loading

oobabooga commented Feb 28, 2023 •

edited

Loading

oobabooga commented Mar 1, 2023 •

edited

Loading

BlinkDL commented Mar 1, 2023 •

edited

Loading

BlinkDL commented Mar 14, 2023 •

edited

Loading

BlinkDL commented Mar 15, 2023 •

edited

Loading

BlinkDL commented Mar 17, 2023 •

edited

Loading

BlinkDL commented Mar 19, 2023 •

edited

Loading

Slug-Cat commented Mar 20, 2023 •

edited

Loading

Ph0rk0z commented Mar 21, 2023 •

edited

Loading

Ph0rk0z commented Mar 21, 2023 •

edited

Loading