Open Issues Need Help
View All on GitHubA framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: A bug is reported where an 'AttributeError: 'FastAPI' object has no attribute 'state'' occurs after the instance is shut down. The traceback indicates the error happens during the FastAPI HTTP server shutdown process within the vLLM Omni project.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes optimizations for the HunYuanImage diffusion model, focusing on improving performance by targeting the Attention and MoE modules, which account for the majority of execution time. Key areas for improvement include adding CI benchmarks, ensuring accuracy alignment with the reference implementation, and exploring parallelization techniques for CFG, attention, and VAE tiling.
A framework for efficient model inference with omni-modality models
AI Summary: This issue requests community help to establish a performance baseline for the `fish-speech-S2` model within `vllm-omni`. The goal is to benchmark its current performance against a reference like `sglang-omni` to identify specific bottlenecks in areas like scheduling, memory management, or operator efficiency, guiding future optimizations.
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes a new stage-based serving architecture for vLLM-Omni to handle multi-stage models like Diffusion and Omni. The core idea is to treat each stage as an independent deployment unit, allowing for flexible scaling ratios (e.g., 1:N:M) to optimize throughput and latency based on individual stage characteristics.
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes adding support for LongCat-AudioDiT, a novel zero-shot Text-to-Speech (TTS) model from Meituan. Unlike existing models, it operates directly in a waveform latent space using a Wav-VAE and a DiT diffusion backbone, bypassing traditional mel-spectrograms and vocoders. The model boasts SOTA speaker similarity and supports multiple languages via its UMT5 text encoder.
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes adding support for the OpenMOSS-Team/MOSS-TTS model to vllm-omni. The request is minimal, lacking details on existing similar models, difficulty, or use cases, but indicates a desire to integrate this new text-to-speech model.
A framework for efficient model inference with omni-modality models
AI Summary: The user wants to know when the `vllm-omni` tool will support the `wan2.2-fun-a14b-inp` model. This is a feature request to add support for a specific new model.
A framework for efficient model inference with omni-modality models
AI Summary: This feature request proposes adding the ability to trigger model-specific performance tests in vLLM-Omni using tags like PR labels, comments, or commit tags. This would allow for more flexible and on-demand benchmarking of different models without manual CI pipeline adjustments.
A framework for efficient model inference with omni-modality models
AI Summary: This RFC outlines the Q2 2026 roadmap for vLLM-Omni NPU development, focusing on transitioning from initial setup to production readiness. Key goals include enhancing CI/CD pipelines for NPU to match GPU efficiency and test coverage, and improving the performance and scalability of diffusion models on NPU through compile and attention backends.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes adding support for the VoXtream2 Text-to-Speech model to vllm-omni. VoXtream2 is a smaller, zero-shot TTS model with dynamic speaking-rate control and a multi-stage transformer architecture. While it shares similarities with existing supported models like Qwen3-TTS, it utilizes a custom codec and a distinct 3-stage pipeline that will require specific implementation.
A framework for efficient model inference with omni-modality models
AI Summary: This is a Request for Comments (RFC) issue to gather ideas for the vLLM-Omni roadmap for Q2 2026. It outlines potential areas of development including new model support (Omni, World, TTS, Diffusion), feature enhancements (quantization, streaming, RL integration), large-scale deployment strategies, and hardware support (ROCm, Intel XPU, Ascend NPU). The goal is to collect feedback before finalizing the roadmap.
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes to unify Rotary Position Embedding (RoPE) implementations across various models in vllm-omni. Currently, there's code duplication and inconsistent performance due to different implementation patterns, some of which are inefficient. The goal is to leverage optimized implementations already present in vLLM core and apply consistent, efficient patterns to all models.
A framework for efficient model inference with omni-modality models
AI Summary: This RFC outlines the Q2 2026 development roadmap for Text-to-Speech (TTS) in vLLM-Omni, focusing on making TTS easy, reliable, and composable. Key themes include making streaming a core feature for both input and output, and enabling TTS to act as a composable layer that can be easily integrated with any existing vLLM text model.
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes a performance optimization roadmap for the WAN2.2 image-to-video diffusion model within the vLLM-OMNI framework. The goal is to significantly reduce inference latency on both GPU and NPU hardware, making it suitable for real-world applications and improving hardware usability. The plan involves optimizing operators, distributed strategies, and NPU-specific execution models.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This RFC outlines the Q2 roadmap for the Qwen-Omni family of models, focusing on improving performance metrics like time-to-first-token/audio and enhancing support for long, multi-turn, and streaming sessions. Key goals include achieving production parity with upstream vLLM scheduling features such as prefix caching and chunked prefill, and expanding model and feature support.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes adding support for the TADA (Hume AI) Text-to-Speech (TTS) model to vllm-omni. TADA is a novel speech-language model built on Llama 3.2 that generates speech tokens directly from text. Implementing this would likely involve a two-stage pipeline similar to existing TTS models like Qwen3 TTS.
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes implementing Pipeline Parallelism (PP) and Stream Batching for real-time video generation in vLLM-Omni. The current Sequence Parallelism (SP) approach is inefficient for the low-token, memory-bound workloads of streaming, leading to high latency. PP and Stream Batching are presented as a solution to achieve sub-second latency and enable scalable world model serving.
A framework for efficient model inference with omni-modality models
AI Summary: This RFC outlines the Q2 2026 roadmap for the vLLM-Omni diffusion module, focusing on improving user experience and core functionality. Key priorities include developing diffusion model recipes, implementing robust parameter validation, and enhancing continuous batching for diffusion models to boost throughput and enable streaming job management.
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes restructuring the vLLM-Omni test suite to improve maintainability and ergonomics. Key changes include splitting the monolithic root `conftest.py` into scope-specific files, moving support modules out of `conftest`, and reducing duplicated code. The goal is to make tests easier to reason about, reduce startup costs, and improve isolation.
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes adding support for Microsoft's VibeVoice TTS models, specifically VibeVoice-Realtime-0.5B for streaming and VibeVoice-TTS-1.5B for long-form multi-speaker generation. The integration leverages existing vLLM infrastructure and the models' next-token diffusion architecture.
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes refactoring the DiffusionEngine to address significant performance bottlenecks. Key issues include CPU starvation due to busy-waiting, a sequential bottleneck in request processing, inaccurate telemetry reporting, and redundant computations. The proposed changes aim to improve concurrency, optimize resource utilization, and enhance code maintainability.
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes adding support for "World Models" to vLLM-Omni, which are models that predict how the world evolves in response to actions, enabling real-time interactive loops for applications like robotics and interactive simulations. This requires handling new input/output modalities (action sequences, multi-view images, embodiment state) and implementing a multi-turn, stateful API that accumulates context, unlike the current stateless, unidirectional streaming.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes integrating the Covo-Audio-Chat model, an end-to-end audio LLM, into vllm-omni. The non-duplex version can be handled as a two-stage pipeline, similar to existing Qwen2.5-Omni integrations, while the full-duplex variant would require additional streaming support.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This issue requests the addition of the Kimi-Audio-7B model to the vllm-omni library. The user has identified Qwen3-TTS as the closest currently supported model. The request is tagged as 'help wanted' and 'good first issue', suggesting it's intended to be straightforward.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes the integration of PrismAudio, a novel video-to-audio generation framework, into vllm-omni. PrismAudio utilizes a diffusion-based architecture with chain-of-thought reasoning and multi-objective reinforcement learning to generate synchronized environmental sounds and sound effects from video content. The proposed integration suggests leveraging existing diffusion infrastructure within vllm-omni.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes adding support for the Wan2.2-I2V-A14B text-to-video model to the vllm-omni multimodal generation framework. The goal is to enable high-performance inference for this popular video generation model by integrating its unique architecture and optimizing memory usage within vllm-omni's existing capabilities.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This feature request proposes decoupling the VAE (Variational Autoencoder) from the main diffusion model (DiT/UNet) within the vllm-omni pipeline. Currently, both models reside on the same GPU, leading to high memory pressure, especially for large VAEs and high-resolution generation. By treating the VAE as a separate stage, it can be moved to a different GPU or offloaded, significantly reducing peak GPU memory usage and enabling more efficient resource utilization.
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes adding support for the SoulX-Duplug-0.6B model, a plug-and-play semantic VAD module for real-time speech conversation. Key challenges include integrating a specific speech tokenizer (GLM-4-Voice), implementing streaming chunk-based inference with block-causal attention, and orchestrating a hybrid inference pipeline that uses an external ASR model.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes integrating AudioX, a multimodal audio generation diffusion model, into vllm-omni. AudioX supports text, video, and audio as conditioning inputs and uses a diffusion-based architecture without an autoregressive component, making it potentially suitable for vllm-omni's existing image generation serving patterns.
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes integrating the Fun-CineForge model, a zero-shot movie dubbing system, into vllm-omni. Fun-CineForge utilizes a two-stage architecture involving a multimodal large language model (MLLM) for semantic token generation and flow matching with DiT for mel-spectrogram synthesis, with added support for multimodal inputs and speaker switching.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes refactoring the Qwen3-Omni and Qwen2.5-Omni thinker implementations to eliminate duplicate code by directly leveraging the main vLLM repository. The goal is to improve maintainability by reducing redundant code.
A framework for efficient model inference with omni-modality models
AI Summary: The user is experiencing unexpectedly high GPU VRAM usage with the Qwen-Image model in vllm-omni, even when `gpu_memory_utilization` is set very low. This suggests a significant memory overhead beyond model weights, potentially related to the diffusion stage and the `gpu_memory_utilization` parameter's effectiveness.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes adding support for low-precision quantization techniques, specifically MXFP4 and advanced methods like FourOverSix, to vLLM-Omni for multimodal models. The primary goal is to significantly reduce memory usage and improve inference efficiency for large models like Wan2.2, making them more accessible while maintaining high accuracy.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: The GitHub issue proposes adding support for CosyVoice 2/3, a popular LLM-based streaming Text-to-Speech (TTS) model. It leverages a Qwen2.5-0.5B backbone to generate speech tokens via Finite Scalar Quantization (FSQ), followed by causal flow matching for audio synthesis. The integration aims to benefit from `vllm-omni`'s existing transformer optimizations, offering low-latency streaming and multi-language support.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes a significant refactoring of the output processing pipeline in vllm-omni to address dead code and rigid string-based routing. The plan involves simplifying the existing processor, introducing a Modality Registry, a Router abstraction, and composable output handlers to better support multi-output models.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes adding generic hooks to vLLM-Omni to support multi-stage CFG inference, specifically for models like BagelPipeline. This allows Stage-1 diffusion models to receive and utilize multi-branch conditional KV caches from Stage-0 autoregressive models, improving image quality by enabling CFG without altering the core framework.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes to consolidate and rename various internal data structures related to model execution into a single concept called `model_intermediate_buffer`. The goal is to eliminate ambiguity, standardize data transfer between components, and make the transfer scope explicit in model configuration, without altering backend implementations or generation logic.
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes adding MVP support for the PersonaPlex (nvidia/personaplex-7b-v1) model to vLLM-Omni. The primary challenge lies in PersonaPlex's full-duplex speech-to-speech capabilities, which differ from vLLM-Omni's current staged approach. The MVP will focus on turn-based speech input/output integration, excluding full-duplex features for now.
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes adding the ability to cancel in-progress and queued diffusion generation requests. The goal is to improve resource utilization and user experience by allowing users to abort long-running tasks, similar to how LLM requests can be cancelled in vLLM. Current implementation has partial support, with open issues related to interruption granularity, pipeline dependencies, endpoint coverage, and resource cleanup verification.
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes adding native support for Full Shared Data Parallel (FSDP) to vLLM-Omni. FSDP is crucial for efficiently training and running inference on large models that exceed single-GPU memory, especially when Tensor Parallelism (TP) is not performant for certain model architectures. Implementing FSDP would align vLLM-Omni with industry best practices and enable users to handle larger models and potentially achieve better performance.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes to change vLLM's GPU memory accounting to be process-scoped, rather than relying on a utilization heuristic. This aims to improve the speed and reliability of concurrent Omni stage initialization by preventing overlapping memory allocations from interfering with each other and enabling parallel stage loading.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes adding support for configuration arguments to the online serving functionality. The motivation is to achieve significant performance gains that are currently unavailable due to this limitation. The feature aims to unlock these performance improvements for online serving.
A framework for efficient model inference with omni-modality models
AI Summary: The user is requesting documentation for online serving of Qwen-Image-Layered models. They have provided a link to the vLLM-Omni documentation but noted that the specific example for this model is missing. This is a straightforward request for content addition to existing documentation.
A framework for efficient model inference with omni-modality models
AI Summary: This RFC proposes enhancing vLLM-Omni's support for state-of-the-art DiT (Diffusion Transformer) models, encompassing image, video, and any-to-any generation. It aims to extend the existing Qwen-Image implementation and actively seeks community contributions for this rapidly evolving domain.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This issue proposes adding support for the LLADA 2.0 Series of models to vllm-omni. The user notes that none of the existing models in vllm-omni are close to LLADA 2.0 and highlights that it's a diffusion language model requiring a diffusion engine, indicating potential complexity.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models
AI Summary: This GitHub issue requests support for the PyTorch profiler in vLLM-Omni, as its current absence makes performance bottleneck analysis difficult for multimodal/diffusion workloads. Users have confirmed that setting `profile=True` or configuring `VLLM_TORCH_PROFILER_DIR` does not activate the profiler. Maintainers acknowledge this as a planned but unscheduled feature.
A framework for efficient model inference with omni-modality models
A framework for efficient model inference with omni-modality models