vLLM is an easy way to spin up Large Language Models locally. It handles a lot of things automatically, including downloading models from HuggingFace.

Embedding server

You can start an embedding server like this:

uv run --python cpython-3.12.11-linux-x86_64-gnu --with vllm -- vllm serve --task embedding Qwen/Qwen3-Embedding-0.6B

To limit memory usage, you can put a maximum context size. You can also get a decent speedup by disabling all the stdout logging vLLM does.

VLLM_CONFIGURE_LOGGING=0 uv run --python cpython-3.12.11-linux-x86_64-gnu --with vllm -- vllm serve --task embedding Qwen/Qwen3-Embedding-0.6B --gpu-memory-utilization 0.25 --max-model-len 2048 --disable-uvicorn-access-log

vLLM

Embedding server

Citation

Comments

Table of contents

Search

More links