We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
在 prefill 阶段调用 flash-attention v2 官方实现,计算如下 attention QKV shape [batch, seq_len, num_heads, head_dim] = [1, 128*1024, 128, 128] 计算速度相当慢,有没有针对长上下文(>128K)注意力的优化思路,请教大佬
The text was updated successfully, but these errors were encountered:
@caijixueIT 这个已经是很经典的prefill compute bound问题了,目前我能回忆出来的也不多:
run_mha_fwd_splitkv_dispatch
xformers.ops.memory_efficient_attention.
Sorry, something went wrong.
No branches or pull requests
在 prefill 阶段调用 flash-attention v2 官方实现,计算如下 attention
QKV shape [batch, seq_len, num_heads, head_dim] = [1, 128*1024, 128, 128] 计算速度相当慢,有没有针对长上下文(>128K)注意力的优化思路,请教大佬
The text was updated successfully, but these errors were encountered: