How to deploy DeepSeek-V4-Pro on Lambda

TL;DR: token throughput

vLLM

Hardware	Gen. throughput	Per-user gen	Total throughput	TTFT (mean)	ITL (mean)
NVIDIA HGX B200	855 tok/s	27 tok/s	4,273 tok/s	3,656 ms	36 ms

(8192 in / 2048 out tokens, 32 parallel requests, 512 prompts)

The benchmark uses a 4:1 input-to-output token ratio (8192 in / 2048 out per request) to simulate long-context coding and document-analysis workflows, in which large contexts are provided as input with substantial completions as output.

Benchmark configuration:

vllm bench serve \
  --backend openai-chat \
  --model deepseek-ai/DeepSeek-V4-Pro \
  --served-model-name deepseek \
  --dataset-name random \
  --random-input-len 8192 \
  --random-output-len 2048 \
  --num-prompts 512 \
  --max-concurrency 32 \
  --endpoint /v1/chat/completions

See Benchmarking DeepSeek-V4-Pro for the full results.

Background

DeepSeek-V4-Pro is a 1.6T-parameter sparse Mixture-of-Experts (MoE) language model from DeepSeek-AI. It supports a native 1M-token context window and ships with three reasoning effort modes (Non-think, Think High, Think Max).

Several architectural changes drive the model's efficiency at long context vs. DeepSeek-V3.2:

Hybrid attention. A combination of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). At a 1M-token context, DeepSeek-V4-Pro uses only 27% of single-token inference FLOPs and 10% of the KV cache of DeepSeek-V3.2.
Manifold-Constrained Hyper-Connections (mHC). A residual-connection refinement that improves signal propagation stability across the deep MoE stack.
Muon optimizer. Used end-to-end during pre-training on 32T tokens for faster convergence and greater training stability.
FP4 + FP8 mixed precision. DeepSeek-V4-Pro stores MoE expert parameters at FP4 while non-expert parameters use FP8, roughly halving the on-disk and in-VRAM footprint compared to a pure FP8 release.

In Think Max mode, DeepSeek-V4-Pro leads open-weight coding benchmarks (LiveCodeBench Pass@1: 93.5, Codeforces Rating: 3,206) and is competitive with frontier closed-source models such as Claude Opus 4.6, GPT-5.4, and Gemini 3.1-Pro on reasoning and agentic tasks.

Model specifications

Overview

Name: DeepSeek-V4-Pro
Author: DeepSeek-AI
Architecture: MoE with hybrid CSA + HCA attention
License: MIT

Specifications

Total parameters: 1.6T (49B active per forward pass)
Context window: 1M tokens (native)
Precision: FP4 (MoE experts) + FP8 (other parameters), mixed
Reasoning modes: Non-think, Think High, Think Max

Hardware requirements

Minimal deployment:
- NVIDIA HGX B200 node

Deployment and benchmarking

Deploying DeepSeek-V4-Pro

DeepSeek-V4-Pro requires a NVIDIA HGX B200 node to load the full 1.6T-parameter model.

Launch an instance with a NVIDIA HGX B200 from the Lambda Cloud Console using the GPU Base 24.04 image.
Connect to your instance via SSH or the JupyterLab terminal. See Connecting to an instance for detailed instructions.
Start the vLLM server:

docker run -d --gpus all \
  --privileged --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enable-expert-parallel \
  --data-parallel-size 8 \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}' \
  --attention_config.use_fp4_indexer_cache=True \
  --max-model-len auto \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --gpu-memory-utilization .85

This launches a vLLM server with an OpenAI-compatible API on port 8000. Notable flags:

--data-parallel-size 8 with --enable-expert-parallel runs the model under data parallelism with expert parallelism, which is the recommended topology for V4-Pro on 8× NVIDIA B200 GPUs.
--kv-cache-dtype fp8 and --attention_config.use_fp4_indexer_cache=True keep KV-cache memory low enough to use the 1M-token context window.
--tokenizer-mode deepseek_v4, --tool-call-parser deepseek_v4, and --reasoning-parser deepseek_v4 enable the V4 tokenizer, OpenAI-compatible tool calls, and the model's reasoning-mode output format.

Verify the server is running:

curl -X GET http://localhost:8000/v1/models \
  -H "Content-Type: application/json"

You should see deepseek-ai/DeepSeek-V4-Pro listed in the response.

Benchmarking DeepSeek-V4-Pro

Benchmarks were collected with vllm bench serve using an 8192-input / 2048-output token workload across 512 prompts at 32 concurrent requests.

vLLM

Token throughput (NVIDIA HGX B200):

Metric	Tokens per second
Output generation	855
Total (input & output)	4,273

Latency (Mean / P99 in ms) (NVIDIA HGX B200):

Metric	Mean	P99
Time to first token	3,656	11,223
Time per output token	36	41
Inter-token latency	36	45

Next steps

Upstream

Downstream

Use as a Claude Code backend

Use your self-hosted DeepSeek-V4-Pro instead of Anthropic's API for local development. Replace <NODE_IP> with the IP of the node where the vLLM server is running:

export ANTHROPIC_BASE_URL="http://<NODE_IP>:8000"
export ANTHROPIC_API_KEY="dummy"

export ANTHROPIC_MODEL="deepseek-ai/DeepSeek-V4-Pro"
export ANTHROPIC_DEFAULT_SONNET_MODEL="deepseek-ai/DeepSeek-V4-Pro"
export ANTHROPIC_DEFAULT_OPUS_MODEL="deepseek-ai/DeepSeek-V4-Pro"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="deepseek-ai/DeepSeek-V4-Pro"

export ANTHROPIC_SMALL_FAST_MODEL="deepseek-ai/DeepSeek-V4-Pro"
export ANTHROPIC_FAST_MODEL="deepseek-ai/DeepSeek-V4-Pro"

export DISABLE_TELEMETRY=1
export ENABLE_PROMPT_CACHING_1H=1

claude

Ready to get started?

Create your Lambda Cloud account and launch NVIDIA GPU instances in minutes. Looking for long-term capacity? Talk to our team.

Launch GPU instance