🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton

large-language-models machine-learning-systems natural-language-processing
1 Open Issue Need Help Last updated: Jun 30, 2025

Open Issues Need Help

View All on GitHub

AI Summary: Debug an inconsistency in the chunk-wise inference of GSA and GDN linear attention models within the Flash Linear Attention library. The issue arises when comparing full-sequence inference to segmented inference with cache transfer, resulting in significant output discrepancies for GSA and GDN, but not RWKV7. The task involves reproducing the bug, analyzing the code (GSA and GDN implementations), and identifying the source of the inconsistency, potentially related to cache handling or internal state management within the Triton kernels.

Complexity: 4/5
bug good first issue urgent

🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton

Python
#large-language-models#machine-learning-systems#natural-language-processing