A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

incubating
6 Open Issues Need Help Last updated: Mar 17, 2026

Open Issues Need Help

View All on GitHub

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating

AI Summary: This issue proposes to differentiate response IDs generated by the `/chat/completions` and `/completions` API endpoints. Specifically, the prefix for `/completions` responses should be changed from its current value to "cmpl-" to ensure unique identification.

Complexity: 2/5
good first issue

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating
good first issue

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating

AI Summary: Implement a new command-line parameter, `--max-model-len`, in the vLLM simulator. This parameter will define the maximum context window size (in tokens) for the model. Requests exceeding this limit should return a 400 Bad Request error with a specific error message indicating the context length exceeded.

Complexity: 4/5
enhancement good first issue

A lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models.

Go
#incubating