A framework for efficient model inference with omni-modality models

audio-generation diffusion image-generation inference model-serving multimodal pytorch transformer video-generation
100 Open Issues Need Help Last updated: Mar 18, 2026

Open Issues Need Help

View All on GitHub
[RFC]: World Model Support about 14 hours ago

AI Summary: This RFC proposes adding support for "World Models" to vLLM-Omni, which are models that predict how the world evolves in response to actions, enabling real-time interactive loops for applications like robotics and interactive simulations. This requires handling new input/output modalities (action sequences, multi-view images, embodiment state) and implementing a multi-turn, stateful API that accumulates context, unlike the current stateless, unidirectional streaming.

Complexity: 4/5
help wanted good first issue new model high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: The OmniVoice online-serving example client fails because it hardcodes the voice parameter to 'default', which is not supported by the OmniVoice server. The server rejects this with an HTTP 400 error. The suggested fix is to either remove the 'voice' key from the request payload or make it an optional argument.

Complexity: 1/5
bug good first issue

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes adding a new `recipes/` directory to vLLM-Omni to host community-maintained runbooks. These recipes will guide users on how to run specific models on particular hardware for various tasks, addressing current discoverability issues and aligning with the structure of the upstream vLLM recipes repository.

Complexity: 2/5
help wanted good first issue high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes the integration of PrismAudio, a novel video-to-audio generation framework, into vllm-omni. PrismAudio utilizes a diffusion-based architecture with chain-of-thought reasoning and multi-objective reinforcement learning to generate synchronized environmental sounds and sound effects from video content. The proposed integration suggests leveraging existing diffusion infrastructure within vllm-omni.

Complexity: 3/5
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted good first issue high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes integrating the Fun-CineForge model, a zero-shot movie dubbing system, into vllm-omni. Fun-CineForge utilizes a two-stage architecture involving a multimodal large language model (MLLM) for semantic token generation and flow matching with DiT for mel-spectrogram synthesis, with added support for multimodal inputs and speaker switching.

Complexity: 3/5
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes the integration of the MiniMind-O model family into vLLM-Omni. MiniMind-O is an end-to-end Omni model capable of processing text, audio, and vision inputs, and generating text and streaming audio output. The integration involves adding support for the Mimi codec and a Multi-Token Prediction (MTP) audio head, while reusing existing architecture components.

Complexity: 3/5
help wanted good first issue

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes optimizations for the HunYuanImage diffusion model, focusing on improving performance by targeting the Attention and MoE modules, which account for the majority of execution time. Key areas for improvement include adding CI benchmarks, ensuring accuracy alignment with the reference implementation, and exploring parallelization techniques for CFG, attention, and VAE tiling.

Complexity: 4/5
help wanted good first issue new model high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for the SoulX-Singer model, an open-source zero-shot Singing Voice Synthesis (SVS) and Singing Voice Conversion (SVC) model. The model is trained on a large dataset and supports melody or score conditioning. A key challenge is its non-autoregressive pipeline, which differs from existing AR models in vllm-omni.

Complexity: 4/5
help wanted good first issue tts

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes implementing Pipeline Parallelism (PP) and Stream Batching for real-time video generation in vLLM-Omni. The current Sequence Parallelism (SP) approach is inefficient for the low-token, memory-bound workloads of streaming, leading to high latency. PP and Stream Batching are presented as a solution to achieve sub-second latency and enable scalable world model serving.

Complexity: 4/5
help wanted good first issue high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for Stability AI's new Stable Audio 3 models, which are successors to a previously supported model. Key differences include a new autoencoder, a standalone package instead of a diffusers pipeline, and variable-length generation requiring scheduler adjustments.

Complexity: 3/5
good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC outlines the Q2 2026 development roadmap for Text-to-Speech (TTS) in vLLM-Omni, focusing on making TTS easy, reliable, and composable. Key themes include making streaming a core feature for both input and output, and enabling TTS to act as a composable layer that can be easily integrated with any existing vLLM text model.

Complexity: 4/5
help wanted good first issue high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes a new stage-based serving architecture for vLLM-Omni to handle multi-stage models like Diffusion and Omni. The core idea is to treat each stage as an independent deployment unit, allowing for flexible scaling ratios (e.g., 1:N:M) to optimize throughput and latency based on individual stage characteristics.

Complexity: 4/5
enhancement help wanted high priority roadmap

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This feature request aims to add word-level timestamps to the streaming output of Text-to-Speech (TTS) models. This is crucial for voice-agent use cases like barge-in and mid-utterance interruption, allowing the system to precisely identify where a user interrupts. The proposed solution involves a model-agnostic forced aligner utility and an extension to the streaming response schema.

Complexity: 3/5
enhancement good first issue tts

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue requests the addition of the HiDream-O1-Image model to vLLM-Omni. The user highlights it as a state-of-the-art text-to-image open-weights model, suggesting it's a valuable addition. The closest supported model is Sensenova-U1, which might offer some existing architectural similarities.

Complexity: 3/5
help wanted new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for the Sana-WM video generation model to vllm-omni. Sana-WM is a 2.6B parameter model capable of generating 720p videos up to 1 minute long, controlled by camera parameters. Its implementation requires integrating a Gemma text encoder, a DiT backbone, a dual-branch camera control module, and an LTX2 video VAE tokenizer, with new components needed for camera control and attention backends.

Complexity: 4/5
help wanted new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes enhancing the nightly CI test coverage for vLLM-Omni core models, including functionality, performance, and accuracy tests. The goal is to improve the stability and observability of the nightly pipeline to detect regressions early, with specific models like Wan2.2, Hunyuan-Image 3.0, Qwen-Image, Qwen3-TTS, and Qwen3-Omni identified for monitoring.

Complexity: 3/5
help wanted NPU

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes porting fused kernels from SGLang to vLLM-Omni to improve performance on CUDA, HIP, XPU, and MUSA platforms for Qwen3-Omni-Class models. The goal is to close the performance gap by implementing fusion for diffusion DiT components like RMSNorm, LayerNorm, RoPE, and adaLN scale-shift, without altering existing behavior on non-CUDA platforms.

Complexity: 4/5
help wanted good first issue

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for the Anima text-to-image diffusion model to vLLM-Omni. The primary challenge lies in adapting the system to handle 'Diffusion Single File' checkpoints, as opposed to the currently supported repository-style loading. Successful integration would broaden vLLM-Omni's image generation capabilities, particularly for anime and illustration content.

Complexity: 3/5
help wanted new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted good first issue high priority diffusion

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for the 'Lance' model from ByteDance to the vllm-omni library. The user indicates that the WAN family of models is the closest existing supported model, suggesting a potential starting point for integration.

Complexity: 2/5
help wanted new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes adding support for the Wan-AI/Wan2.2-S2V-14B model to vllm-omni. This model is a state-of-the-art audio-driven system capable of generating lip-synced videos from input images and audio.

Complexity: 3/5
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for the TADA (Hume AI) Text-to-Speech (TTS) model to vllm-omni. TADA is a novel speech-language model built on Llama 3.2 that generates speech tokens directly from text. Implementing this would likely involve a two-stage pipeline similar to existing TTS models like Qwen3 TTS.

Complexity: 3/5
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC outlines the Q2 2026 roadmap for vLLM-Omni NPU development, focusing on transitioning from initial setup to production readiness. Key goals include enhancing CI/CD pipelines for NPU to match GPU efficiency and test coverage, and improving the performance and scalability of diffusion models on NPU through compile and attention backends.

Complexity: 4/5
help wanted good first issue NPU high priority Hardware Plugin

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This is a Request for Comments (RFC) issue to gather ideas for the vLLM-Omni roadmap for Q2 2026. It outlines potential areas of development including new model support (Omni, World, TTS, Diffusion), feature enhancements (quantization, streaming, RL integration), large-scale deployment strategies, and hardware support (ROCm, Intel XPU, Ascend NPU). The goal is to collect feedback before finalizing the roadmap.

Complexity: 2/5
help wanted good first issue roadmap

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes a continuous tracking system for diffusion model recipe readiness within the vLLM-Omni project. It aims to maintain a matrix of supported diffusion models against available in-repo recipes, indicating their readiness status (ready, partial, or not ready) to facilitate contributions and ensure comprehensive support.

Complexity: 2/5
help wanted good first issue diffusion

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC outlines the Q2 2026 roadmap for the vLLM-Omni diffusion module, focusing on improving user experience and core functionality. Key priorities include developing diffusion model recipes, implementing robust parameter validation, and enhancing continuous batching for diffusion models to boost throughput and enable streaming job management.

Complexity: 4/5
enhancement good first issue high priority diffusion

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC addresses significant audio gaps in Qwen3-TTS streaming at high throughput (c ≥ 64), impacting the primary business workload of Qwen3-TTS-Base + voice_clone. Current performance is far below the SLO, with analysis revealing that throughput fixes alone are insufficient, necessitating parallel scheduler improvements. The issue highlights asymmetric performance bottlenecks between voice_clone (Stage 0 bound) and CV (Stage 1 bound), and points to limitations in current batching and caching strategies.

Complexity: 4/5
help wanted high priority tts

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
documentation enhancement help wanted good first issue

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes adding support for NVIDIA's Alpamayo 1.5, a 10B-parameter Vision-Language-Action (VLA) model for autonomous driving, to vLLM-Omni. The model's architecture is similar to existing supported models like BAGEL, and it leverages a natively supported Qwen3-VL backbone. The primary new challenge is handling the trajectory output modality, which differs from current image/video/audio outputs.

Complexity: 3/5
help wanted world model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for SongGeneration 2 (LeVo 2), a hybrid LLM-Diffusion model for full-song generation from Tencent AI Lab. The model has a complex multi-stage architecture involving both language modeling and diffusion for generating high-fidelity music. Integration would require mapping its components to existing vllm-omni pipelines.

Complexity: 4/5
help wanted good first issue tts

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue requests the addition of the HiDream-O1-Image model to vllm-omni. The user points to the model's GitHub repository and notes that HiDream-I1 is already supported, suggesting a potential similarity that could aid integration. However, details on the specific implementation challenges or use cases are not provided.

Complexity: 3/5
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes a collaboration roadmap between Mooncake and vLLM-Omni to improve disaggregated inference for omni-modality models. The plan focuses on enhancing Qwen-omni disaggregation, optimizing the Transfer Engine connector, introducing high-performance communication backends, and implementing efficient multimodal embedding data sharing.

Complexity: 4/5
help wanted high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for Vevo2, a new unified controllable model for speech and singing voice generation. Vevo2 uses a two-stage approach: an AR content-style stage for controllability and a flow-matching acoustic stage for timbre control. The architecture has similarities to existing models in vllm-omni, suggesting a potential integration path.

Complexity: 3/5
help wanted good first issue tts

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for SongGen, a single-stage auto-regressive transformer for text-to-song generation. It leverages a similar architecture to existing models like Qwen3-TTS and uses X-Codec for audio tokenization, making it a potentially straightforward integration.

Complexity: 2/5
help wanted good first issue tts

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue addresses the ambiguity in deploy YAML field ownership and values that arose during a recent migration. The goal is to classify these fields by ownership and precedence, distinguishing between deploy-level, stage-level, user-tunable, and model-owned configurations. The recommendation is to keep the current compatibility PR focused and tackle this classification in subsequent, smaller steps.

Complexity: 3/5
good first issue frontend

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for the OpenMOSS-Team/MOSS-TTS model to vllm-omni. The request is minimal, lacking details on existing similar models, difficulty, or use cases, but indicates a desire to integrate this new text-to-speech model.

Complexity: 2/5
help wanted new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for the Omni-Diffusion model to the vllm-omni library. The user has provided a link to the model's GitHub repository and identified Bagel/HY-Image as the closest currently supported model. The request is labeled as 'help wanted' and 'good first issue', suggesting it's intended for community contribution.

Complexity: 2/5
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue requests the addition of the Meta Tuna series of models to the vllm-omni library. The user has provided a link to the model's repository and identified 'bagel' as the closest currently supported model. Further details on difficulty, use case, and motivation are missing.

Complexity: 2/5
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This bug report indicates that the `profiler_config` field in the deploy YAML is not being passed correctly after a recent code change (#3078). The issue stems from fields not on a whitelist being set to null, preventing expected inheritance of vLLM engine arguments and the proper listing of omni-specific arguments.

Complexity: 3/5
bug help wanted high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for the new DeepSeek Janus model to the vLLM-omni library. The user indicates that 'bagel&hy-image' is the closest currently supported model, suggesting a potential starting point for integration. The issue is tagged as 'help wanted' and 'high priority'.

Complexity: 3/5
help wanted new model high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This feature request proposes decoupling the VAE (Variational Autoencoder) from the main diffusion model (DiT/UNet) within the vllm-omni pipeline. Currently, both models reside on the same GPU, leading to high memory pressure, especially for large VAEs and high-resolution generation. By treating the VAE as a separate stage, it can be moved to a different GPU or offloaded, significantly reducing peak GPU memory usage and enabling more efficient resource utilization.

Complexity: 4/5
enhancement help wanted

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue reports a performance problem with Qwen-Image-2512 model deployment using vllm-omni on Huawei Ascend 910B3 NPUs. When configured with two cards using Tensor Parallelism (TP2), Data Parallelism (DP2), or Pipeline Parallelism (PP2), the single image generation time is similar to a single-card setup, failing to leverage multi-card advantages. Concurrent generation also shows no performance improvement or even a decrease in throughput.

Complexity: 4/5
help wanted good first issue NPU

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue requests the addition of the VoxCPM2 model to the vLLM-omni library. The user has provided a link to the model's repository and noted that VoxCPM #2467 is the closest supported model. The request is labeled as 'help wanted' and 'good first issue'.

Complexity: 2/5
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes refactoring the DiffusionEngine to address significant performance bottlenecks. Key issues include CPU starvation due to busy-waiting, a sequential bottleneck in request processing, inaccurate telemetry reporting, and redundant computations. The proposed changes aim to improve concurrency, optimize resource utilization, and enhance code maintainability.

Complexity: 4/5
help wanted high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: The CUDNN_ATTN backend crashes when using torch.compile with LTX-2.0 audio attention due to symbolic head dimensions. The cuDNN SDPA backend selector requires static head dimensions, which are not met during tracing, leading to a TorchRuntimeError. Workarounds include using FLASHINFER_ATTN or TORCH_SDPA.

Complexity: 3/5
help wanted

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC outlines the roadmap for vLLM-Omni on Intel XPU hardware for Q2 2026. It aims to expand model coverage, introduce new features like Sequence Parallel and XPU Graph, improve quantization support, and enhance user experience with streaming capabilities and documentation. The plan also includes performance optimizations for specific models and advancements in memory management.

Complexity: 4/5
help wanted Hardware Plugin roadmap

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
documentation help wanted

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes improvements to vLLM-Omni's metrics user experience by providing consistent per-stage timing for all pipelines. The goal is to clearly report input preprocessing time, avoid misleading diffusion-only timings, and make online serving and offline benchmarks easier to compare, while keeping default logs concise.

Complexity: 3/5
help wanted good first issue high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes consolidating the default values for engine arguments from three sources (argparse, OmniEngineArgs dataclass, StageDeployConfig dataclass) into a single source: StageDeployConfig. The goal is to simplify the precedence logic by making defaults in argparse and OmniEngineArgs sentinel values (None), relying solely on the merge layer's existing guard for precedence. This refactor aims to eliminate ambiguity in how user-provided versus default values are handled, particularly for programmatic callers.

Complexity: 3/5
help wanted high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes performance optimizations for the GLM-Image model, specifically targeting the AR module which currently consumes a significant portion of inference time. The plan includes analyzing performance bottlenecks, improving scheduling with streaming and asynchronous chunking, and exploring quantization for the Matmul operations within the AR module.

Complexity: 4/5
help wanted good first issue NPU high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: The user is reporting a crash when using the SageAttention backend version 2.2.0 with the Hunyuan-Video model. The issue occurs specifically during text-to-video generation, and the user has provided reproduction steps and environment details. A baseline comparison without SageAttention is also included.

Complexity: 3/5
bug help wanted

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes integrating NVIDIA ModelOpt quantized checkpoints (FP8 and NVFP4) into vLLM-OMNI for diffusion and video generation models. Currently, vLLM-OMNI lacks the capability to consume these optimized checkpoints, leading to performance bottlenecks. The proposed solution involves auto-detecting ModelOpt metadata and leveraging existing quantization kernel infrastructure to enable efficient deployment of these compute-intensive models.

Complexity: 4/5
help wanted good first issue

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: The user is reporting a bug where the Qwen3-TTS-CustomVoice model, when used with vLLM-Omni, ignores the fixed seed setting. Despite configuring a seed (e.g., 42), the backend logs show a seed of 0, leading to non-deterministic audio outputs. This prevents the generation of consistent, reproducible audio.

Complexity: 3/5
bug help wanted good first issue tts

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes a performance optimization roadmap for the WAN2.2 image-to-video diffusion model within the vLLM-OMNI framework. The goal is to significantly reduce inference latency on both GPU and NPU hardware, making it suitable for real-world applications and improving hardware usability. The plan involves optimizing operators, distributed strategies, and NPU-specific execution models.

Complexity: 4/5
help wanted good first issue NPU high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes adding the ability to cancel in-progress and queued diffusion generation requests. The goal is to improve resource utilization and user experience by allowing users to abort long-running tasks, similar to how LLM requests can be cancelled in vLLM. Current implementation has partial support, with open issues related to interruption granularity, pipeline dependencies, endpoint coverage, and resource cleanup verification.

Complexity: 4/5
enhancement help wanted

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted good first issue

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes adding support for the Wan2.2-I2V-A14B text-to-video model to the vllm-omni multimodal generation framework. The goal is to enable high-performance inference for this popular video generation model by integrating its unique architecture and optimizing memory usage within vllm-omni's existing capabilities.

Complexity: 4/5
help wanted

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes adding a new 'Reconstruction Engine' to vLLM-Omni to support 3D world model inference, starting with the VGGT model. This engine will use a chunked feed-forward approach with a sliding window KV cache to efficiently process sequential 3D data, enabling streaming input and bounding memory usage.

Complexity: 4/5
help wanted new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC outlines the Q2 roadmap for the Qwen-Omni family of models, focusing on improving performance metrics like time-to-first-token/audio and enhancing support for long, multi-turn, and streaming sessions. Key goals include achieving production parity with upstream vLLM scheduling features such as prefix caching and chunked prefill, and expanding model and feature support.

Complexity: 4/5
help wanted good first issue high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes a reorganization of the benchmark layout to improve discoverability, reusability, and onboarding for new contributors. The core idea is to group benchmarks by modality first, standardize the placement of scripts, configs, and documentation, and create a clear pattern for adding new model benchmarks.

Complexity: 3/5
help wanted high priority diffusion tts

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted good first issue

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
[New Model]: OmniWeaving about 1 month ago

AI Summary: This issue proposes adding support for the OmniWeaving model to vLLM-Omni. OmniWeaving is a multimodal model composed of a Qwen2.5-VL based MLLM and a HunyuanVideo 1.5 based MMDiT, with the latter already supported.

Complexity: 3/5
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
[New Model]: Ovi 1.1 about 1 month ago

AI Summary: This issue proposes adding support for a new model called "Ovi 1.1" to the vllm-omni library. The closest existing supported model family is "WAN". The request is tagged as "help wanted" and "good first issue", indicating it's likely straightforward to implement.

Complexity: 2/5
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
enhancement help wanted good first issue new model high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This feature request proposes adding the ability to trigger model-specific performance tests in vLLM-Omni using tags like PR labels, comments, or commit tags. This would allow for more flexible and on-demand benchmarking of different models without manual CI pipeline adjustments.

Complexity: 3/5
help wanted

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes restructuring the vLLM-Omni test suite to improve maintainability and ergonomics. Key changes include splitting the monolithic root `conftest.py` into scope-specific files, moving support modules out of `conftest`, and reducing duplicated code. The goal is to make tests easier to reason about, reduce startup costs, and improve isolation.

Complexity: 4/5
help wanted high priority roadmap

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: A bug is reported where an 'AttributeError: 'FastAPI' object has no attribute 'state'' occurs after the instance is shut down. The traceback indicates the error happens during the FastAPI HTTP server shutdown process within the vLLM Omni project.

Complexity: 3/5
bug help wanted

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted good first issue NPU high priority diffusion

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue requests community help to establish a performance baseline for the `fish-speech-S2` model within `vllm-omni`. The goal is to benchmark its current performance against a reference like `sglang-omni` to identify specific bottlenecks in areas like scheduling, memory management, or operator efficiency, guiding future optimizations.

Complexity: 2/5
help wanted good first issue tts

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for LongCat-AudioDiT, a novel zero-shot Text-to-Speech (TTS) model from Meituan. Unlike existing models, it operates directly in a waveform latent space using a Wav-VAE and a DiT diffusion backbone, bypassing traditional mel-spectrograms and vocoders. The model boasts SOTA speaker similarity and supports multiple languages via its UMT5 text encoder.

Complexity: 4/5
help wanted good first issue new model tts

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: The user wants to know when the `vllm-omni` tool will support the `wan2.2-fun-a14b-inp` model. This is a feature request to add support for a specific new model.

Complexity: 2/5
good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for the VoXtream2 Text-to-Speech model to vllm-omni. VoXtream2 is a smaller, zero-shot TTS model with dynamic speaking-rate control and a multi-stage transformer architecture. While it shares similarities with existing supported models like Qwen3-TTS, it utilizes a custom codec and a distinct 3-stage pipeline that will require specific implementation.

Complexity: 3/5
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This RFC proposes to unify Rotary Position Embedding (RoPE) implementations across various models in vllm-omni. Currently, there's code duplication and inconsistent performance due to different implementation patterns, some of which are inefficient. The goal is to leverage optimized implementations already present in vLLM core and apply consistent, efficient patterns to all models.

Complexity: 3/5
help wanted

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
enhancement help wanted good first issue high priority

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
[RFC]: Exit on OOM about 2 months ago
enhancement help wanted

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes adding support for Microsoft's VibeVoice TTS models, specifically VibeVoice-Realtime-0.5B for streaming and VibeVoice-TTS-1.5B for long-form multi-speaker generation. The integration leverages existing vLLM infrastructure and the models' next-token diffusion architecture.

Complexity: 3/5
help wanted good first issue new model tts

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
enhancement help wanted good first issue

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

AI Summary: This issue proposes integrating the Covo-Audio-Chat model, an end-to-end audio LLM, into vllm-omni. The non-duplex version can be handled as a two-stage pipeline, similar to existing Qwen2.5-Omni integrations, while the full-duplex variant would require additional streaming support.

Complexity: 3/5
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
[New Model]: Kimi-Audio-7B about 2 months ago

AI Summary: This issue requests the addition of the Kimi-Audio-7B model to the vllm-omni library. The user has identified Qwen3-TTS as the closest currently supported model. The request is tagged as 'help wanted' and 'good first issue', suggesting it's intended to be straightforward.

Complexity: 2/5
help wanted good first issue new model

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
enhancement help wanted good first issue

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
help wanted good first issue

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation
enhancement help wanted good first issue

A framework for efficient model inference with omni-modality models

Python
#audio-generation#diffusion#image-generation#inference#model-serving#multimodal#pytorch#transformer#video-generation