TL;DR: token throughput
Measured on NVIDIA HGX B200, CUDA 12.8. 8192 in / 2048 out tokens, 32 concurrent requests.
vLLM
| Hardware | Gen. throughput | Per-user gen | Total throughput | TTFT (mean) | ITL (mean) |
| NVIDIA HGX B200 | 1,157.35 tok/s | 36.17 tok/s | 5,786.74 tok/s | 3,312.87 ms | 26.03 ms |
Benchmark command
(8192 in / 2048 out tokens, 32 parallel requests, 512 prompts. Measured on NVIDIA HGX B200 with CUDA 12.8.)
The benchmark uses a 4:1 input-to-output token ratio (8192 in / 2048 out per request) to simulate long-context coding and document-analysis workflows.
Benchmark configuration:
vllm bench serve \
--backend openai-chat \
--model moonshotai/Kimi-K2.7-Code \
--served-model-name kimi-k2.7-code \
--dataset-name random \
--random-input-len 8192 \
--random-output-len 2048 \
--num-prompts 512 \
--max-concurrency 32 \
--endpoint /v1/chat/completions
See Benchmarking Kimi-K2.7-Code below for the full results.
Background
Kimi-K2.7-Code is Moonshot AI's coding-focused member of the Kimi K2 family. Moonshot took the K2.6 checkpoint and specialized it for long-horizon software engineering through reinforcement learning on the read, plan, edit, run, and debug loop that real coding work demands, often stretched across hundreds of steps. They also tuned it to spend roughly 30% fewer reasoning tokens than K2.6 while holding or improving quality, which matters because output tokens dominate the cost of running a reasoning model over a long session. Thinking is always on and cannot be turned off, and the model carries its full reasoning forward across turns rather than discarding it after each one, so a number or plan it worked out earlier remains available later in the conversation.
The coding-specific tuning is evident across Moonshot's reported benchmarks, every one of which improves over K2.6. Kimi Code Bench v2 rises from 50.9 to 62.0, a 21.8% gain; Program Bench moves from 48.3 to 53.6; and MLS-Bench-Lite climbs from 26.7 to 35.1, up 31.5%.
Model specifications
Overview
- Name: Kimi-K2.7-Code
- Author: Moonshot AI
- Architecture: Mixture-of-Experts (DeepSeek-V3-style MoE backbone with Multi-head Latent Attention; multimodal
kimi_k25architecture with a MoonViT vision encoder) - License: Modified MIT
Specifications
- Total parameters: 1T total (32B active)
- Context window: 256k (262,144) tokens
Hardware requirements
- Minimal deployment:
- 1× NVIDIA HGX B200 (8× NVIDIA B200 GPU system) is required.
Deployment and benchmarking
Deploying Kimi-K2.7-Code
Kimi-K2.7-Code is served on a full NVIDIA HGX B200 with tensor parallelism across all 8 GPUs.
- Launch an instance from the Lambda Cloud Console using the GPU Base 24.04 image — NVIDIA HGX B200. These benchmarks were run with CUDA 12.8 (driver 570).
- Connect to your instance via SSH or the JupyterLab terminal. See Connecting to an instance for detailed instructions.
- Start the inference server.
vLLM
CUDA version note: Kimi-K2.7-Code has no published vendor image. These benchmarks were run with the vLLM nightly
vllm/vllm-openai:nightly-e312c5cb25427e76fc3830ab14e7b6bc0963a55con a CUDA 12.8 host; substitute the current vLLM nightly if that tag is unavailable.
docker run -d --gpus all \
--ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:nightly-e312c5cb25427e76fc3830ab14e7b6bc0963a55c \
--model moonshotai/Kimi-K2.7-Code \
--served-model-name kimi-k2.7-code \
--tensor-parallel-size 8 \
--trust-remote-code \
--mm-encoder-tp-mode data \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--enable-auto-tool-choice
Verify the server
This launches an inference server with an OpenAI-compatible API on port 8000. Verify it:
curl -X GET http://localhost:8000/v1/models \
-H "Content-Type: application/json"
You should see kimi-k2.7-code listed in the response.
Benchmarking Kimi-K2.7-Code
Workload: 8192 input / 2048 output tokens, 512 prompts, 32 concurrent requests. Measured on NVIDIA HGX B200 with CUDA 12.8.
vLLM
NVIDIA HGX B200
Token throughput:
| Metric | Tokens per second |
| Output generation | 1,157.35 |
| Total (input & output) | 5,786.74 |
Latency (Mean / P99 in ms):
| Metric | Mean | P99 |
| Time to first token | 3,312.87 | 29,902.17 |
| Time per output token | 26.03 | 31.07 |
| Inter-token latency | 26.03 | 369.65 |
Next steps
Upstream
Downstream
Use as a noumena code backend
Use your self-hosted Kimi-K2.7-Code as the backend to noumena's code framework rather than their hosted models for local development. Replace <NODE_IP> with the IP of the node where the server is running.
Important note: make sure to set --served-model-name as my-kimi2.7-code to avoid hitting the reserved aliases.
git clone https://github.com/noumena-network/code.git
cd code
bun install
bun run build
OPENAI_API_KEY="dummy" \
OPENAI_BASE_URL="http://<NODE_IP>:8000/v1" \
OPENAI_MODEL="my-kimi2.7-code" \
./.tmp/packages/ncode-0.1.0-linux-x64/ncode \
--print \
--model my-kimi2.7-code \
--max-turns 1 \
"Reply exactly: ok"
Use as a Claude Code backend
Use your self-hosted Kimi-K2.7-Code instead of Anthropic's API for local development. Replace <NODE_IP> with the IP of the node where the vLLM server is running:
export ANTHROPIC_BASE_URL="http://<NODE_IP>:8000"
export ANTHROPIC_API_KEY="dummy"
export ANTHROPIC_MODEL="kimi-k2.7-code"
export ANTHROPIC_DEFAULT_SONNET_MODEL="kimi-k2.7-code"
export ANTHROPIC_DEFAULT_OPUS_MODEL="kimi-k2.7-code"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="kimi-k2.7-code"
export ANTHROPIC_SMALL_FAST_MODEL="kimi-k2.7-code"
export ANTHROPIC_FAST_MODEL="kimi-k2.7-code"
export DISABLE_TELEMETRY=1
export ENABLE_PROMPT_CACHING_1H=1
claude