.

vLLM


Reading time: less than 1 minute

vLLM is an easy way to spin up Large Language Models locally. It handles a lot of things automatically, including downloading models from HuggingFace.

Embedding server

You can start an embedding server like this:

uv run --python cpython-3.12.11-linux-x86_64-gnu --with vllm -- vllm serve --task embedding Qwen/Qwen3-Embedding-0.6B

To limit memory usage, you can put a maximum context size. You can also get a decent speedup by disabling all the stdout logging vLLM does.

VLLM_CONFIGURE_LOGGING=0 uv run --python cpython-3.12.11-linux-x86_64-gnu --with vllm -- vllm serve --task embedding Qwen/Qwen3-Embedding-0.6B --gpu-memory-utilization 0.25 --max-model-len 2048 --disable-uvicorn-access-log

Citation

If you find this work useful, please cite it as:
@article{yaltirakli,
  title   = "vLLM",
  author  = "Yaltirakli, Gokberk",
  journal = "gkbrk.com",
  year    = "2025",
  url     = "https://www.gkbrk.com/vllm"
}
Not using BibTeX? Click here for more citation styles.
IEEE Citation
Gokberk Yaltirakli, "vLLM", September, 2025. [Online]. Available: https://www.gkbrk.com/vllm. [Accessed Sep. 29, 2025].
APA Style
Yaltirakli, G. (2025, September 29). vLLM. https://www.gkbrk.com/vllm
Bluebook Style
Gokberk Yaltirakli, vLLM, GKBRK.COM (Sep. 29, 2025), https://www.gkbrk.com/vllm

Comments

© 2025 Gokberk Yaltirakli