Open Issues Need Help
View All on GitHub Add vllm OR excllama optimization about 1 month ago
AI Summary: Integrate either the exllamav2 or vLLM inference engine into the existing Offline-Voice-LLM-assistant project to significantly reduce the model's inference time (currently around 5 seconds per response). This involves adding the chosen engine's dependencies, modifying the code to utilize its functionalities, and testing the improved performance.
Complexity:
4/5
good first issue
Running small but capable language models entirely offline
Python
#bitsandbytes#chatbot#edge-ai#huggingface#llm#local-ai#offline-ai#quantization#smollm3#speech-recognition#voice-assistant#vosk