-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we get RWKV model family support? #98
Comments
That would be quite interesting |
The lack of huggingface integration makes this challenging, but it should be possible. |
Here is client code from the author https://github.com/BlinkDL/ChatRWKV |
Here is the API for RWKV: https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py |
ChatRWKV v2: with "stream" and "split" strategies. 3G VRAM is enough to run RWKV 14B :)
|
@BlinkDL is it possible to construct a wrapper to the model that simply does something like this?
That would be very helpful. |
Sure. How about temp, top-p, top-k stuffs? |
Those things are nice, but not essential at first. FlexGen for instance has departed from HuggingFace classes but still provided a |
updated https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py
Note: it's slow (not optimized yet) when your prompt is long. Better keep it as short as possible (for now). |
I added some tips to https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py |
Very nice, @BlinkDL! Thank you for this. I will try incorporating it into the web UI later. |
Now ChatRWKV can quickly preprocess the context if you set os.environ["RWKV_CUDA_ON"] = '1' before loading it :) |
We have a first successful run: #149 I am very impressed with the coherence of the model so far, even though I have used the smallest version. @BlinkDL I have a few questions:
|
For alpha_frequency and alpha_presence, see "Frequency and presence penalties": Please clone and import until I make the pip package :) Too busy at the moment, but I will do it. |
The pip package is here :) |
That 3g of memory is all GPU? Or is it offloaded? Would this scale to a 60B model and fit in a single GPU? |
Much easier with the pip package. Some more tweaks and I should be able to merge the PR. |
I have merged the PR the RWKV models now work in the web UI. Some comments:
This is the full code for the PR: https://github.com/oobabooga/text-generation-webui/pull/149/files There is probably lots of room for improvement. |
Cool. Please use rwkv==0.0.6 (fix a bug with temperature for CPU) You can use "strategies" to load 14B on 3090. For example 'cuda fp16 *30 -> cpu fp32' [try increasing 30, for better speed, until you run out of VRAM] |
Done. I have added a new flag to let users specify their custom strategies manually:
As you suggested, |
My streaming implementation is probably really dumb and much slower than it could be: for i in range(max_new_tokens//8):
reply = shared.model.generate(question, token_count=8, temperature=temperature, top_p=top_p)
yield formatted_outputs(reply, shared.model_name)
question = reply It would be best to use the |
@BlinkDL You said earlier |
see the post above you. |
If it replies faster/better than a regular 13b even with the split, it's still something. Plus the faster time to train. But I guess miracles we will not get. |
Why not? RWKV is faster than regular 13b and better on many tasks. |
@BlinkDL but its 24gb not 3gb, i really wanned to run that on 3080ti which only has 12gbs |
"cuda fp16 *12 -> cpu fp32" [try increasing 12, for better speed, until you run out of VRAM] |
Tried the 14b. The coherence is excellent but the reply times were in the 160s range. Loaded it like this:
Used about 20gb of vram/20gb ram. During generation it only used a single core and the GPU was mostly idle. Was that a bug? |
@oobabooga new and better RWKV 14B ctx8192 https://huggingface.co/BlinkDL/rwkv-4-pile-14b/blob/main/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth (there is 7B ctx8192 too but not as good as ctx4096 when ctxlen < 3k) |
Update ChatRWKV v2 & pip rwkv package (0.5.0) and set os.environ["RWKV_CUDA_ON"] = '1' |
New one is blocked by BlinkDL/ChatRWKV#38 |
got it working with
on a 3080ti with 12gb of varam and 64gb of ram |
Latest model: https://huggingface.co/BlinkDL/rwkv-4-pile-14b/blob/main/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth |
Yes downloading now after I saw your reddit post, and I asked there what's the difference between this version and the previous one? |
Update ChatRWKV v2 & pip rwkv package (0.6.0): The os.environ["RWKV_CUDA_ON"] = '1' mode is now 10% faster for f16i8 on all GPUs, and supports old GPUs. Supports "chunk_len" in PIPELINE_ARGS, which can split long inputs into chunks to save VRAM (shorter -> slower). |
fixed in latest version :) |
After I update transformers it's loading using a single core again and it is abysmal. Performance is also as you see since only one CPU is used now.
I even tried clearing the chat :( no help. |
Update ChatRWKV v2 & pip rwkv package (0.7.0): @Ph0rk0z what is the problematic transformers version |
Transformers went from using multiple cores during GPTQ and model loading to single core but I did a test and it proves performance is better at least for llama. BTW I was meaning to ask why I can't get deterministic result from RWKV? I will test the perf of the new update. in a minute. |
Non-deterministic if you use the CUDA kernel because of non-deterministic float point computation orders |
Ahh. that makes some sense. Time to check it. |
For my Pascal based card at least there is no difference.
90s is same response time as offloaded 30b model. I think fix for older GPU made it run very slow. |
Please benchmark 0.5.0 0.6.0 0.6.1 0.6.2 so that I can find the culprit :) |
I have benchmarked before.. Checked and made sure and transformers only affected the model load time.
The problem started when you changed the int8 multiplication to something unsupported. The fix for that is too slow so I have been using 0.4.2. |
I am unable to use the --rwkv-cuda-on option, as I get this error:
|
From ChatRWKV readme: Note RWKV_CUDA_ON will build a CUDA kernel ("pip install ninja" first).How to build in Linux: set these and run v2/chat.pyexport PATH=/usr/local/cuda/bin:$PATH How to build in win:Install VS2022 build tools (https://aka.ms/vs/17/release/vs_BuildTools.exe select Desktop C++). Reinstall CUDA 11.7 (install VC++ extensions). Run v2/chat.py in "x64 native tools command prompt". |
We will try a fp32i8 kernel soon (similar VRAM usage as fp16i8) |
can we use this on this webui with longer context length than 2048? I would like to see if I could summarize a long article. |
Yes.. just rename the settings template and change |
I tried to benchmark more methodically the differences between 0.4.2 and 0.7.2 There are some gain and some losses depending on the strategy that is being used. I finally understand what this is doing a little bit better. Of course ideally it would be a deterministic prompt but since that is not possible:
All have the same initial context size of just the character and 1 message. Importance is of course overall it/s and initial time to reply since this is used as chat. I shoot for highest token count to try to get a stable speed. Also latest commits to this repo made some changes to speed. Module loads using all cores again. As you see, top speed is better in 0.7.0 but much worse average speed. Wait to start generating every time feels worse too. Not sure how to measure that. |
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below. |
A very promising RNN based model, instead of transformers, claiming more lean memory consumption and faster inference
Here is the 14b card
https://huggingface.co/BlinkDL/rwkv-4-pile-14b
The text was updated successfully, but these errors were encountered: