TL;DR: token throughput
| Hardware | Gen. throughput | Per-user gen | Total throughput | TTFT (mean) | ITL (mean) |
| NVIDIA HGX B200 (native FP4+FP8 build) | 1,222 tok/s | 38 tok/s | 11,000 tok/s | 1,701 ms | 66 ms |
| NVIDIA HGX H100 (FP8-quantized build) | 1,262 tok/s | 39 tok/s | 11,361 tok/s | 2,463 ms | 60 ms |
| Hardware | Gen. throughput | Per-user gen | Total throughput | TTFT (mean) | ITL (mean) |
| NVIDIA HGX B200 (native FP4+FP8 build) | 1,469 tok/s | 46 tok/s | 13,217 tok/s | 1,452 ms | 20 ms |
Benchmark command
(8192 in / 1024 out tokens, 32 parallel requests, 512 prompts. SGLang runs use EAGLE speculative decoding.)
The benchmark uses an 8:1 input-to-output token ratio (8192 in / 1024 out per request) to simulate long-context coding and document-analysis workflows.
Benchmark configuration:
vllm bench serve \
--backend openai-chat \
--model deepseek-ai/DeepSeek-V4-Flash \
--served-model-name deepseek \
--dataset-name random \
--random-input-len 8192 \
--random-output-len 1024 \
--num-prompts 512 \
--max-concurrency 32 \
--endpoint /v1/chat/completions
See Benchmarking DeepSeek-V4-Flash for the full results.
Background
DeepSeek-V4-Flash is a 284 billion parameter sparse Mixture-of-Experts (MoE) language model from DeepSeek-AI, with only 13 billion parameters active per forward pass. It is the smaller sibling of DeepSeek-V4-Pro (1.6T / 49B active) and ships with the same native 1 million token context window and the same three reasoning effort modes (Non-think, Think High, Think Max).
DeepSeek-V4-Flash inherits the V4 architecture stack:
- Hybrid attention. Compressed Sparse Attention (CSA) plus Heavily Compressed Attention (HCA), dramatically reducing single-token inference FLOPs and KV-cache memory at long context.
- Manifold-Constrained Hyper-Connections (mHC). Refined residual connections for stable signal propagation in deep MoE stacks.
- Muon optimizer. Used during pre-training on 32T tokens for faster convergence and stability.
- FP4 + FP8 mixed precision. MoE expert parameters are FP4 while non-expert parameters are FP8, roughly halving the on-disk and in-VRAM footprint compared to a pure FP8 release.
In Think Max mode, DeepSeek-V4-Flash reaches reasoning quality comparable to DeepSeek-V4-Pro on benchmarks like LiveCodeBench (91.6 Pass@1), HMMT 2026 Feb (94.8 Pass@1), and MMLU-Pro (86.2 EM), at a fraction of the deployment footprint. The Pro model retains an edge on raw knowledge tasks (Simple-QA, Chinese-SimpleQA) and the most complex agentic workflows.
Model specifications
Overview
- Name: DeepSeek-V4-Flash
- Author: DeepSeek-AI
- Architecture: MoE with hybrid CSA + HCA attention
- License: MIT
Specifications
- Total parameters: 284B (13B active per forward pass)
- Context window: 1M tokens (native)
- Precision: FP4 (MoE experts) + FP8 (other parameters), mixed
- Reasoning modes: Non-think, Think High, Think Max
Hardware requirements
DeepSeek-V4-Flash ships in two builds:
- Native FP4+FP8 mixed (deepseek-ai/DeepSeek-V4-Flash, ~146 GB on disk) requires NVIDIA B200 GPUs for FP4 expert weights.
- FP8-only quantized (sgl-project/DeepSeek-V4-Flash-FP8, ~284 GB on disk), quantized release for NVIDIA H100 GPUs.
Deployment and benchmarking
Deploying DeepSeek-V4-Flash
DeepSeek-V4-Flash can be served with vLLM or SGLang on Blackwell, or with SGLang on Hopper using the FP8-quantized build.
- Launch an instance from the Lambda Cloud Console using the GPU Base 24.04 image —
NVIDIA HGX B200 (8× NVIDIA B200 GPUs)for the native build, orNVIDIA HGX H100 (8× NVIDIA H100 GPUs)for the FP8 build. - Connect to your instance via SSH or the JupyterLab terminal. See Connecting to an instance for detailed instructions.
- Start the inference server using one of the backends below.
NVIDIA HGX B200 (native FP4+FP8 build):
docker run -d --gpus all \
--ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
lmsysorg/sglang:deepseek-v4-blackwell \
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V4-Flash \
--served-model-name deepseek \
--tp 4 \
--host 0.0.0.0 --port 8000 \
--trust-remote-code \
--mem-fraction-static 0.9 \
--moe-runner-backend flashinfer_mxfp4 \
--speculative-algo EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--chunked-prefill-size 4096 \
--disable-flashinfer-autotune
--moe-runner-backend flashinfer_mxfp4 enables the FlashInfer FP4 expert kernel (Blackwell-only). EAGLE speculative decoding is enabled for additional throughput.
NVIDIA HGX H100 (FP8-quantized build):
Important: Hopper has no FP4 hardware, so on H100 you must run the FP8-quantized release (
sgl-project/DeepSeek-V4-Flash-FP8). SetSGLANG_DSV4_FP4_EXPERTS=0so SGLang doesn't attempt the FP4 kernel.
docker run -d --gpus all \
--ipc=host -p 8000:8000 \
-e SGLANG_DSV4_FP4_EXPERTS=0 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
lmsysorg/sglang:deepseek-v4-hopper \
python3 -m sglang.launch_server \
--model-path sgl-project/DeepSeek-V4-Flash-FP8 \
--served-model-name deepseek \
--tp 8 \
--host 0.0.0.0 --port 8000 \
--trust-remote-code \
--mem-fraction-static 0.85 \
--speculative-algo EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--chunked-prefill-size 4096 \
--disable-flashinfer-autotune
The FP8 build is roughly 2× the size of the native FP4+FP8 build, which is why TP=8 across all 8 NVIDIA H100 GPUs in the HGX system is the minimum on Hopper.
NVIDIA HGX B200 (native FP4+FP8 build):
docker run -d --gpus all \
--ipc=host -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:deepseekv4-cu130 \
--model deepseek-ai/DeepSeek-V4-Flash \
--served-model-name deepseek \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--kv-cache-dtype fp8 \
--block-size 256 \
--max-model-len auto \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--tokenizer-mode deepseek_v4 \
--no-disable-hybrid-kv-cache-manager \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4
Notable flags:
--tensor-parallel-size 4with--enable-expert-parallelshards attention/shared params 4-way and routes the 256 MoE experts across all 8 NVIDIA B200 GPUs.--no-disable-hybrid-kv-cache-managerkeeps vLLM's hybrid KV-cache manager on, required for V4's CSA + HCA hybrid attention.--kv-cache-dtype fp8keeps KV-cache memory low enough to take advantage of the 1M-token context window.--tokenizer-mode deepseek_v4,--tool-call-parser deepseek_v4, and--reasoning-parser deepseek_v4enable the V4 tokenizer, OpenAI-compatible tool calls, and the model's reasoning-mode output format.
Verify the server
Either command launches an inference server with an OpenAI-compatible API on port 8000. Verify it:
curl -X GET http://localhost:8000/v1/models \
-H "Content-Type: application/json"
You should see deepseek listed in the response.
Benchmarking DeepSeek-V4-Flash
Workload: 8192 input / 1024 output tokens, 512 prompts, 32 concurrent requests.
NVIDIA HGX B200 (native FP4+FP8 build, EAGLE speculative decoding)
Token throughput:
| Metric | Tokens per second |
| Output generation | 1,222 |
| Total (input & output) | 11,000 |
Latency (Mean / P99 in ms):
| Metric | Mean | P99 |
| Time to first token | 1,701 | 20,886 |
| Time per output token | 24.2 | 51.6 |
| Inter-token latency | 65.5 | 559.0 |
NVIDIA HGX H100 (FP8-quantized build, EAGLE speculative decoding)
Token throughput:
| Metric | Tokens per second |
| Output generation | 1,262 |
| Total (input & output) | 11,361 |
Latency (Mean / P99 in ms):
| Metric | Mean | P99 |
| Time to first token | 2,463 | 32,221 |
| Time per output token | 22.7 | 54.9 |
| Inter-token latency | 60.3 | 624.4 |
NVIDIA HGX B200 (native FP4+FP8 build)
Token throughput:
| Metric | Tokens per second |
| Output generation | 1,469 |
| Total (input & output) | 13,217 |
Latency (Mean / P99 in ms):
| Metric | Mean | P99 |
| Time to first token | 1,452 | 10,085 |
| Time per output token | 20.4 | 28.5 |
| Inter-token latency | 20.3 | 175.9 |
Next steps
Upstream
Downstream
Use as a Claude Code backend
Use your self-hosted DeepSeek-V4-Flash instead of Anthropic's API for local development. Replace <NODE_IP> with the IP of the node where the vLLM server is running:
export ANTHROPIC_BASE_URL="http://<NODE_IP>:8000"
export ANTHROPIC_API_KEY="dummy"
export ANTHROPIC_MODEL="deepseek"
export ANTHROPIC_DEFAULT_SONNET_MODEL="deepseek"
export ANTHROPIC_DEFAULT_OPUS_MODEL="deepseek"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="deepseek"
export ANTHROPIC_SMALL_FAST_MODEL="deepseek"
export ANTHROPIC_FAST_MODEL="deepseek"
export DISABLE_TELEMETRY=1
export ENABLE_PROMPT_CACHING_1H=1
claude