Open Issues Need Help
View All on GitHubAI Summary: This issue proposes adding expert parallelism support to the FSDP backend. This enhancement is crucial for enabling the training of small-scale Mixture-of-Experts (MoE) models, such as Qwen3-30B-A3B, by allowing their experts to be distributed across multiple devices.
AI Summary: This issue proposes implementing the Token-level Importance Sampling (TIS) algorithm within the FSDP backend. Adding TIS support aims to improve training efficiency by allowing the system to prioritize computation on high-importance tokens and reduce redundant work on low-importance regions.
AI Summary: The FSDP backend currently uses default parameters and needs optimization. This involves tuning FSDP-specific hyperparameters like sharding strategies, mixed precision, and communication overlap, and systematically surveying optimal configurations across various workloads. A potential migration from FSDP1 to FSDP2 is also being considered.
AI Summary: This issue requires verifying that the newly integrated FSDP backend for 'slime' can be successfully installed by users using standard package managers like `uv install` or `pip install`. The goal is to ensure a smooth and frictionless adoption process for the FSDP backend.
AI Summary: The FSDP backend needs to implement the Generalized Second-Order Policy Optimization (GSPO) algorithm. This new algorithm support is essential for enabling Mixture-of-Experts (MoE) model training within the FSDP framework.
AI Summary: This issue proposes adding native sequence parallelism support to FSDP, using approaches like DeepSpeed Ulysses or Ring Attention. This enhancement aims to improve memory efficiency and scalability for very long sequence training by distributing sequence length across devices. Implementation should follow the completion of issue #293, and `zhuzilin/ring-flash-attention` is recommended for Ring Attention.
AI Summary: The FSDP backend currently lacks support for a reference model, which is essential for accurate KL loss computation in training paradigms like RLHF or DPO. This issue proposes adding this capability to enable correct and efficient KL loss calculation under FSDP. The implementation should also ensure numerical alignment with the Megatron backend for reliability.
AI Summary: The FSDP (Fully Sharded Data Parallel) system is missing an `UpdateWeightFromDistributed` implementation specifically for its disaggregate mode. This absence prevents correct weight updates, particularly when the training and rollout processes are distributed across different GPUs.
AI Summary: The current 'slime' system lacks support for multimodal data, specifically for Vision-Language Models (VLMs). This issue aims to implement a basic data pipeline capable of handling paired image-text inputs to enable VLM training.
AI Summary: The FSDP data pipeline currently suffers from inefficiencies due to fixed batch sizes and extensive padding for variable-length sequences. The proposed solution is to implement data packing to tightly group sequences and reduce wasted computation. Additionally, the system should support dynamic micro batch size adjustment at runtime to further optimize training efficiency.
AI Summary: This issue highlights a significant GPU memory overhead in FSDP when performing parameter updates under a colocated training configuration. The current implementation loads the entire model and optimizer states into memory simultaneously, which limits scalability and efficiency. The goal is to optimize this process to reduce memory consumption.
AI Summary: This issue focuses on ensuring the newly introduced FSDP backend achieves numerical precision equivalent to the existing Megatron backend. The goal is to verify that key computations (forward pass, backward pass, optimizer updates) produce consistent results within acceptable tolerance, using specific debugging tools to fix rollout data for comparison.