How to serve Kimi-K2-Instruct on Lambda with vLLM
When your model doesn’t fit on a single GPU, you suddenly need to target multiple GPUs on a single machine, configure a serving stack that actually uses all that hardware, and know whether your setup is efficient enough for production.
One such case is Kimi K2 by Moonshot AI. It’s a one-trillion-parameter Mixture-of-Experts (MoE) regression language model capable of very strong coding, writing, and reasoning tasks. Due to its large memory requirements, running it at home is exceedingly difficult and impractical for the average user, as it requires more than a terabyte. Instead of waiting for local hardware to catch up, you can run it today on an 8× NVIDIA Blackwell GPU instance on Lambda using vLLM.
In this post, you’ll learn how to deploy Kimi-K2-Instruct on Lambda using vLLM for efficient multi‑GPU inference in four steps:
- Spin up an 8× NVIDIA Blackwell GPU instance on Lambda
- Start a vLLM deployment server with a single setup block
- Run a reproducible benchmark against that server
- Share the exact model and image references so you can reproduce this setup
1. Model snapshot (on Hugging Face)
- Model name:
Kimi-K2-Instruct - Author:
moonshotai - Primary capabilities: High-capacity MoE LLM optimized for fast reasoning, long-context understanding, and robust coding and tool-use performance.
- License: MIT ⚖️
2. Stats that matter
- Context window:
128K - Weights-on-disk:
959GB - Idle vRAM usage:
1,347GB - Recommended Lambda VRAM configuration:
- 8x Blackwell GPUs (on-demand or 1-Click-Cluster)
- 16x NVIDIA H100 GPUs (1-Click-Cluster)
3. Get it running (one copy-paste block)
Device information:
8xOn-Demand B200's- Base image:
Lambda Stack 22.04
pip install vllm
VLLM_SERVER_DEV_MODE=1 vllm serve moonshotai/Kimi-K2-Instruct \
--port 8000 \
--served-model-name kimi-k2 \
--trust-remote-code \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2 \
--enable-sleep-mode
This will now expose a vLLM server from your node that you can access and send requests to. Most importantly, this is how we can benchmark our server to measure and gather critical information like time-to-first-token, throughput, and more.
4. Benchmark capsule
For benchmarking, we’ll utilize the vllm bench serve command and measure across five complete spin-downs to avoid caching — after the GPUs have warmed up.
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
vllm bench serve \
--backend vllm \
--model moonshotai/Kimi-K2-Instruct \
--served-model-name kimi-k2 \
--dataset-name sharegpt \
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 10 \
--trust-remote-code \
--backend openai-chat \
--endpoint /v1/chat/completions
To "reset" the state each time, we activate sleep mode, which opens up endpoints to
/sleep?level=1and/wake_up?tags=weightsas POST endpoints on your server. E.g.,curl -X POST 'http://localhost:8000/sleep?level=1'. Level 1 will ensure that weights are offloaded to CPU RAM and the KV cache is discarded.
Token throughput:
| Tokens per second | |
|---|---|
| Output generation | 219.644 ± 0.497 |
| Total (Input & Output) | 332.358 ± 0.810 |
Fine-grained numbers:
| Mean (ms) | P99 (ms) | |
|---|---|---|
| Time to First Token | 148.986 ± 6.619 | 173.242 ± 14.551 |
| Time per Output Token | 17.576 ± 0.070 | 19.712 ± 0.139 |
| Inter-token Latency | 16.294 ± 0.059 | 20.140 ± 0.218 |
Conclusion
You now have everything you need to deploy Kimi-K2-Instruct on a single Lambda 8× HGX B200 instance with vLLM. This same pattern applies to other large open models that no longer fit on a single GPU: get the right Lambda instance, run the vLLM server, then run a benchmark you can trust.
If you have questions or want to share your benchmark results, reach out to Zach at zach.mueller@lambda.ai.
References
- Weights: https://huggingface.co/moonshotai/Kimi-K2-Instruct
- Docker image: GPU Base 24.04 on the Lambda platform