Any examples for RWKV7 generation usage? #133

lidh15 · 2025-01-22T03:10:01Z

lidh15
Jan 22, 2025

Hi, though the README doc showed a little bit example, I'm still encountering different issues to try RWKV7 with FLA, the different models here are having quite complicated architecture on comparing with transformers, would you please have a subfolder in the repo that offers examples for different models?

Answered by zhiyuan1i

Apr 6, 2025

It's a known issue that it's slower than official RWKV7.
The first reason is triton based group norm and l2norm, which has been fixed.
The second reason is addcmul, fla combine 5 tensor into one tensor and xr, xw, xk, xv, xa, xg = hidden_states.addcmul(delta, self.x_x.view(6, 1, 1, -1)).unbind(0), which is much much slower. However, I'm trying to figure out how to fix this without breakup.

You can find examples here: https://huggingface.co/fla-hub/rwkv7-1.5B-world

View full answer

lidh15 · 2025-01-22T03:10:56Z

lidh15
Jan 22, 2025
Author

To my surprise this is the very first discussion

0 replies

yzhangcs · 2025-01-22T03:13:13Z

yzhangcs
Jan 22, 2025
Maintainer

@lidh15 Hello, we have a unified API model.generate for generation aligned with 🤗 transformers.
https://github.com/fla-org/flash-linear-attention?tab=readme-ov-file#generation

7 replies

lidh15 Jan 22, 2025
Author

I tried utils/convert_from_rwkv7.py and the model seems to be converted, but it seems that I still need to load tokenizer with rwkv API, right?
And the generation is extremely slow on comparing with using RWKV official script , does the generation API still needs more optimization or it is not the recommended usage?

yzhangcs Jan 22, 2025
Maintainer

Can you paste the code for benchmarking?

lidh15 Jan 22, 2025
Author

I tested with the following script and it takes like 10 minutes to generate on NV GPU, while if we test the same prompt in ChatRWKV, the tokens come at once in 10 seconds.

import torch
from fla.models import RWKV7ForCausalLM
from rwkv.rwkv_tokenizer import TRIE_TOKENIZER
from transformers import AutoModelForCausalLM

# the input prompt is recommended by RWKV official
input_prompt = '''“当然可以，大宇宙不会因为这五公斤就不坍缩了。”关一帆说，他还有一个没说出来的想法：也许大宇宙真的会因为相差一个原子的质量而由封闭转为开放。大自然的精巧有时超出想象，比如生命的诞生，就需要各项宇宙参数在几亿亿分之一精度上的精确配合。但程心仍然可以留下她的生态球，因为在那无数文明创造的无数小宇宙中，肯定有相当一部分不响应回归运动的号召，所以，大宇宙最终被夺走的质量至少有几亿吨，甚至可能是几亿亿亿吨。
但愿大宇宙能够忽略这个误差。
程心和关一帆进入了飞船，智子最后也进来了。她早就不再穿那身华丽的和服了，她现在身着迷彩服，再次成为一名轻捷精悍的战士，她的身上佩带着许多武器和生存装备，最引人注目的是那把插在背后的武士刀。
“放心，我在，你们就在！”智子对两位人类朋友说。
聚变发动机启动了，推进器发出幽幽的蓝光，'''
# convert RWKV7 0.4B 20250107 with utils/convert_from_rwkv7.py
name = 'converted_RWKV7_421M_20250107'
# the vocab txt is part of rwkv pip package
tokenizer = TRIE_TOKENIZER('site-packages/rwkv/rwkv_vocab_v20230424.txt')
input_ids = tokenizer.encode(input_prompt)
input_ids = torch.tensor([input_ids]).cuda()
model = RWKV7ForCausalLM.from_pretrained(name)
model = model.cuda().half()
outputs = model.generate(input_ids, max_new_tokens=128)[0].tolist()
generated = tokenizer.decode(outputs)
print(generated)

yzhangcs Jan 22, 2025
Maintainer

Could you add the env var TRITON_PRINT_AUTOTUNING=1 before the cmd u r running. I speculate this might be the problem of triton autotuning

lidh15 Jan 22, 2025
Author

the logs are:

Triton autotuning for function layer_norm_fwd_kernel finished after 5.49s; best config selected: num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None;
Triton autotuning for function chunk_rwkv6_fwd_cumsum_kernel finished after 1.18s; best config selected: BS: 32, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None;
Triton autotuning for function chunk_dplr_fwd_A_kernel_intra_sub_inter finished after 3.09s; best config selected: BK: 32, num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None;
Triton autotuning for function chunk_dplr_fwd_A_kernel_intra_sub_intra finished after 0.53s; best config selected: num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None;
Triton autotuning for function fwd_prepare_wy_repr_kernel_chunk64 finished after 1.14s; best config selected: num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None;
Triton autotuning for function fwd_wu_kernel finished after 0.28s; best config selected: num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None;
Triton autotuning for function chunk_dplr_fwd_kernel_h finished after 1.56s; best config selected: num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None;
Triton autotuning for function chunk_dplr_fwd_kernel_o finished after 3.20s; best config selected: BK: 64, BV: 64, num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None;
Triton autotuning for function layer_norm_fwd_kernel finished after 2.28s; best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Triton autotuning for function layer_norm_fwd_kernel finished after 2.27s; best config selected: num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None;
Triton autotuning for function fused_recurrent_fwd_kernel finished after 3.06s; best config selected: BV: 32, num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None;

and after printing these logs, the generation soon finished in seconds

yzhangcs · 2025-01-22T03:14:09Z

yzhangcs
Jan 22, 2025
Maintainer

To my surprise this is the very first discussion

cuz we created this org since 2025 :-)

0 replies

yzhangcs · 2025-01-22T13:23:55Z

yzhangcs
Jan 22, 2025
Maintainer

@lidh15 Hi, could you benchmark the generation by the following cmd

python benchmark_generation.py --path fla-hub/rwkv7-168M-pile

3 replies

lidh15 Jan 23, 2025
Author

well, it is not working, errors are:

RuntimeError: Internal Triton PTX codegen error:
ptxas /tmp/tmpo_96258v.ptx, line 156; error   : Feature '.bf16' requires .target sm_80 or higher

snad some hundred more lines. Am I using something with wrong version?

lidh15 Jan 23, 2025
Author

Okay, google said BF16 is not supported by V100 which unfortunately is on my machine

lidh15 Jan 23, 2025
Author

FP16 benchmarking results:

Prompt length: 128, generation length: 128
Total prompt processing + decoding time: 51296ms
Max memory used: 594.2MiB

zhiyuan1i · 2025-04-06T12:47:14Z

zhiyuan1i
Apr 6, 2025
Maintainer

It's a known issue that it's slower than official RWKV7.
The first reason is triton based group norm and l2norm, which has been fixed.
The second reason is addcmul, fla combine 5 tensor into one tensor and xr, xw, xk, xv, xa, xg = hidden_states.addcmul(delta, self.x_x.view(6, 1, 1, -1)).unbind(0), which is much much slower. However, I'm trying to figure out how to fix this without breakup.

You can find examples here: https://huggingface.co/fla-hub/rwkv7-1.5B-world

0 replies

FLA

Any examples for RWKV7 generation usage? #133

Uh oh!

lidh15 Jan 22, 2025

Replies: 5 comments · 10 replies

Uh oh!

lidh15 Jan 22, 2025 Author

Uh oh!

yzhangcs Jan 22, 2025 Maintainer

Uh oh!

lidh15 Jan 22, 2025 Author

Uh oh!

yzhangcs Jan 22, 2025 Maintainer

Uh oh!

lidh15 Jan 22, 2025 Author

Uh oh!

Uh oh!

yzhangcs Jan 22, 2025 Maintainer

Uh oh!

lidh15 Jan 22, 2025 Author

Uh oh!

yzhangcs Jan 22, 2025 Maintainer

Uh oh!

yzhangcs Jan 22, 2025 Maintainer

Uh oh!

lidh15 Jan 23, 2025 Author

Uh oh!

lidh15 Jan 23, 2025 Author

Uh oh!

lidh15 Jan 23, 2025 Author

Uh oh!

zhiyuan1i Apr 6, 2025 Maintainer

lidh15
Jan 22, 2025

Replies: 5 comments 10 replies

lidh15
Jan 22, 2025
Author

yzhangcs
Jan 22, 2025
Maintainer

lidh15 Jan 22, 2025
Author

yzhangcs Jan 22, 2025
Maintainer

lidh15 Jan 22, 2025
Author

yzhangcs Jan 22, 2025
Maintainer

lidh15 Jan 22, 2025
Author

yzhangcs
Jan 22, 2025
Maintainer

yzhangcs
Jan 22, 2025
Maintainer

lidh15 Jan 23, 2025
Author

lidh15 Jan 23, 2025
Author

lidh15 Jan 23, 2025
Author

zhiyuan1i
Apr 6, 2025
Maintainer