Open Issues Need Help
View All on GitHub [Bug] [GDN] [GSA] An Obvious difference between inference as a whole and chunk-wise inference. 2 months ago
AI Summary: Debug an inconsistency in the chunk-wise inference of GSA and GDN linear attention models within the Flash Linear Attention library. The issue arises when comparing full-sequence inference to segmented inference with cache transfer, resulting in significant output discrepancies for GSA and GDN, but not RWKV7. The task involves reproducing the bug, analyzing the code (GSA and GDN implementations), and identifying the source of the inconsistency, potentially related to cache handling or internal state management within the Triton kernels.
Complexity:
4/5
bug good first issue urgent
🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton
Python
#large-language-models#machine-learning-systems#natural-language-processing