vLLM is an easy way to spin up Large Language Models locally. It handles a lot of things automatically, including downloading models from HuggingFace.
Embedding server
You can start an embedding server like this:
uv run --python cpython-3.12.11-linux-x86_64-gnu --with vllm -- vllm serve --task embedding Qwen/Qwen3-Embedding-0.6B
To limit memory usage, you can put a maximum context size. You can also get a decent speedup by disabling all the stdout logging vLLM does.
VLLM_CONFIGURE_LOGGING=0 uv run --python cpython-3.12.11-linux-x86_64-gnu --with vllm -- vllm serve --task embedding Qwen/Qwen3-Embedding-0.6B --gpu-memory-utilization 0.25 --max-model-len 2048 --disable-uvicorn-access-log