A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

incubating
10 Open Issues Need Help Last updated: Mar 17, 2026

Open Issues Need Help

View All on GitHub

AI Summary: This issue proposes adding a new configuration flag, `--force-dummy-tokenizer`, to the simulator. This flag will allow users to explicitly choose the dummy tokenizer, bypassing the loading of the real tokenizer even when a real model is specified. This provides more control over tokenizer behavior.

Complexity: 2/5
good first issue

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating

AI Summary: This issue addresses a bug in echo mode where structured content (like text and image URLs) in chat/completions requests is not properly serialized, resulting in empty responses. The proposed solution involves adding a new method to serialize all block types into a readable string with newlines separating them, and updating the echo mode function to use this new serialization.

Complexity: 2/5
enhancement good first issue

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating

AI Summary: This issue requests a change to the `/v1/models` API endpoint to correctly include the LoRA model's path in the `root` field of the response for LoRA models. Currently, this information is missing, and the desired behavior is to have the `root` field populated with the LoRA model's path.

Complexity: 1/5
good first issue

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating

AI Summary: This issue proposes adding new tests to verify the functionality of metrics when used with gRPC. The goal is to ensure that metrics are correctly collected and reported in a gRPC environment.

Complexity: 2/5
good first issue

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating

AI Summary: This issue proposes to differentiate response IDs generated by the `/chat/completions` and `/completions` API endpoints. Specifically, the prefix for `/completions` responses should be changed from its current value to "cmpl-" to ensure unique identification.

Complexity: 2/5
good first issue

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating
good first issue

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating

AI Summary: Implement a new command-line parameter, `--max-model-len`, in the vLLM simulator. This parameter will define the maximum context window size (in tokens) for the model. Requests exceeding this limit should return a 400 Bad Request error with a specific error message indicating the context length exceeded.

Complexity: 4/5
enhancement good first issue

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating