TL;DR: token throughput
vLLM
| Hardware | Gen. throughput | Per-user gen | Total throughput | TTFT (mean) | ITL (mean) |
| NVIDIA HGX B200 | 855 tok/s | 27 tok/s | 4,273 tok/s | 3,656 ms | 36 ms |
(8192 in / 2048 out tokens, 32 parallel requests, 512 prompts)
The benchmark uses a 4:1 input-to-output token ratio (8192 in / 2048 out per request) to simulate long-context coding and document-analysis workflows, in which large contexts are provided as input with substantial completions as output.
Benchmark configuration:
vllm bench serve \
--backend openai-chat \
--model deepseek-ai/DeepSeek-V4-Pro \
--served-model-name deepseek \
--dataset-name random \
--random-input-len 8192 \
--random-output-len 2048 \
--num-prompts 512 \
--max-concurrency 32 \
--endpoint /v1/chat/completions
See Benchmarking DeepSeek-V4-Pro for the full results.
Background
DeepSeek-V4-Pro is a 1.6T-parameter sparse Mixture-of-Experts (MoE) language model from DeepSeek-AI. It supports a native 1M-token context window and ships with three reasoning effort modes (Non-think, Think High, Think Max).
Several architectural changes drive the model's efficiency at long context vs. DeepSeek-V3.2:
- Hybrid attention. A combination of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). At a 1M-token context, DeepSeek-V4-Pro uses only 27% of single-token inference FLOPs and 10% of the KV cache of DeepSeek-V3.2.
- Manifold-Constrained Hyper-Connections (mHC). A residual-connection refinement that improves signal propagation stability across the deep MoE stack.
- Muon optimizer. Used end-to-end during pre-training on 32T tokens for faster convergence and greater training stability.
- FP4 + FP8 mixed precision. DeepSeek-V4-Pro stores MoE expert parameters at FP4 while non-expert parameters use FP8, roughly halving the on-disk and in-VRAM footprint compared to a pure FP8 release.
In Think Max mode, DeepSeek-V4-Pro leads open-weight coding benchmarks (LiveCodeBench Pass@1: 93.5, Codeforces Rating: 3,206) and is competitive with frontier closed-source models such as Claude Opus 4.6, GPT-5.4, and Gemini 3.1-Pro on reasoning and agentic tasks.
Model specifications
Overview
- Name: DeepSeek-V4-Pro
- Author: DeepSeek-AI
- Architecture: MoE with hybrid CSA + HCA attention
- License: MIT
Specifications
- Total parameters: 1.6T (49B active per forward pass)
- Context window: 1M tokens (native)
- Precision: FP4 (MoE experts) + FP8 (other parameters), mixed
- Reasoning modes: Non-think, Think High, Think Max
Hardware requirements
- Minimal deployment:
- NVIDIA HGX B200 node
Deployment and benchmarking
Deploying DeepSeek-V4-Pro
DeepSeek-V4-Pro requires a NVIDIA HGX B200 node to load the full 1.6T-parameter model.
- Launch an instance with a NVIDIA HGX B200 from the Lambda Cloud Console using the GPU Base 24.04 image.
- Connect to your instance via SSH or the JupyterLab terminal. See Connecting to an instance for detailed instructions.
- Start the vLLM server:
docker run -d --gpus all \
--privileged --ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--enable-expert-parallel \
--data-parallel-size 8 \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache=True \
--max-model-len auto \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--gpu-memory-utilization .85
This launches a vLLM server with an OpenAI-compatible API on port 8000. Notable flags:
--data-parallel-size 8with--enable-expert-parallelruns the model under data parallelism with expert parallelism, which is the recommended topology for V4-Pro on 8× NVIDIA B200 GPUs.--kv-cache-dtype fp8and--attention_config.use_fp4_indexer_cache=Truekeep KV-cache memory low enough to use the 1M-token context window.--tokenizer-mode deepseek_v4,--tool-call-parser deepseek_v4, and--reasoning-parser deepseek_v4enable the V4 tokenizer, OpenAI-compatible tool calls, and the model's reasoning-mode output format.
- Verify the server is running:
curl -X GET http://localhost:8000/v1/models \
-H "Content-Type: application/json"
You should see deepseek-ai/DeepSeek-V4-Pro listed in the response.
Benchmarking DeepSeek-V4-Pro
Benchmarks were collected with vllm bench serve using an 8192-input / 2048-output token workload across 512 prompts at 32 concurrent requests.
vLLM
Token throughput (NVIDIA HGX B200):
| Metric | Tokens per second |
| Output generation | 855 |
| Total (input & output) | 4,273 |
Latency (Mean / P99 in ms) (NVIDIA HGX B200):
| Metric | Mean | P99 |
| Time to first token | 3,656 | 11,223 |
| Time per output token | 36 | 41 |
| Inter-token latency | 36 | 45 |
Next steps
Upstream
Downstream
Use as a Claude Code backend
Use your self-hosted DeepSeek-V4-Pro instead of Anthropic's API for local development. Replace <NODE_IP> with the IP of the node where the vLLM server is running:
export ANTHROPIC_BASE_URL="http://<NODE_IP>:8000"
export ANTHROPIC_API_KEY="dummy"
export ANTHROPIC_MODEL="deepseek-ai/DeepSeek-V4-Pro"
export ANTHROPIC_DEFAULT_SONNET_MODEL="deepseek-ai/DeepSeek-V4-Pro"
export ANTHROPIC_DEFAULT_OPUS_MODEL="deepseek-ai/DeepSeek-V4-Pro"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="deepseek-ai/DeepSeek-V4-Pro"
export ANTHROPIC_SMALL_FAST_MODEL="deepseek-ai/DeepSeek-V4-Pro"
export ANTHROPIC_FAST_MODEL="deepseek-ai/DeepSeek-V4-Pro"
export DISABLE_TELEMETRY=1
export ENABLE_PROMPT_CACHING_1H=1
claude