fla-org/flash-linear-attention

AI Summary: Debug an inconsistency in the chunk-wise inference of GSA and GDN linear attention models within the Flash Linear Attention library. The issue arises when comparing full-sequence inference to segmented inference with cache transfer, resulting in significant output discrepancies for GSA and GDN, but not RWKV7. The task involves reproducing the bug, analyzing the code (GSA and GDN implementations), and identifying the source of the inconsistency, potentially related to cache handling or internal state management within the Triton kernels.

Open Issues Need Help