TL;DR: token throughput
| Hardware | Gen. throughput | TTFT | ITL |
|---|---|---|---|
| 8× B200 | 1,269 tok/s | 1,943ms | 23ms |
| Hardware | Gen. throughput | TTFT | ITL |
|---|---|---|---|
| 8× B200 | 1,268 tok/s | 5,024ms | 20ms |
Benchmark command
Re-run the benchmark:
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen3.5-397B-A17B \
--served-model-name qwen \
--dataset-name random \
--random-input-len 8192 \
--random-output-len 1024 \
--num-prompts 512 \
--max-concurrency 32 \
--endpoint /v1/chat/completions
(8192 in/1024 out tokens, 32 parallel requests)
Background
Qwen3.5-397B-A17B is a 397 billion parameter multimodal vision-language model from Alibaba's Qwen team, featuring a hybrid Gated DeltaNet and Mixture-of-Experts (MoE) architecture. With only 17 billion parameters active per forward pass, 23% fewer than the prior 235B version despite having 69% more total parameters, it achieves comparable or better performance at lower inference cost.
The model's improved performance is due to several key factors:
- Gated Delta Networks (GDN): A hybrid attention architecture alternating between linear attention (Gated DeltaNet) and full attention layers in a 3:1 ratio, reducing KV-cache memory by approximately 4×
- Scaling the MoE further: 512 experts (4× more than Qwen3's 128) with 10+1 active experts per token
- Multi-token prediction: Enables speculative decoding for 2-3× inference speedup
- Unified vision-language: Early fusion training on trillions of multimodal tokens
Qwen3.5 extends context to 256k tokens natively, making it well-suited for agentic workflows, long-context applications, and code analysis.
Model specifications
Overview
- Name: Qwen3.5-397B-A17B
- Author: Alibaba Cloud
- Architecture: MoE
- License: Apache-2.0
Specifications
- Total parameters: 397B (17B active per forward pass)
- Context window: 256k tokens
- Languages: 201 languages and dialects
Hardware requirements
- Minimal deployment:
- NVIDIA HGX B200 (1.5 TB)
Deployment and benchmarking
Deploying Qwen3.5-397B-A17B
Qwen3.5 requires NVIDIA HGX B200 to load the full 397B parameter model.
- Launch an instance with NVIDIA HGX B200 from the Lambda Cloud Console using the GPU Base 24.04 image.
- Connect to your instance via SSH or JupyterLab terminal. See Connecting to an instance for detailed instructions.
- Start the inference server:
docker run \
--gpus all \
-p 8000:8000 \
--ipc=host \
-e HF_HOME=/root/.cache/huggingface \
-v ~/.cache/huggingface:/root/.cache/huggingface \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--host 0.0.0.0 \
--port 8000 \
--model-path Qwen/Qwen3.5-397B-A17B \
--served-model-name qwen \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--mamba-ssm-dtype float32 \
--tp-size 8 \
--trust-remote-code \
--mem-fraction-static 0.85
docker run \
--gpus all \
-p 8000:8000 \
--ipc=host \
-e HF_HOME=/root/.cache/huggingface \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--host 0.0.0.0 \
--port 8000 \
--model Qwen/Qwen3.5-397B-A17B \
--served-model-name qwen \
--tensor-parallel-size 8 \
--trust-remote-code
This launches an inference server with an OpenAI-compatible API on port 8000.
- Verify the server is running:
curl -X GET http://localhost:8000/v1/models \
-H "Content-Type: application/json"
You should see Qwen listed in the response.
Benchmarking results: Qwen3.5-397B-A17B
Token throughput:
| Metric | 8× B200 |
|---|---|
| Output gen (tok/s) | 1,269 |
| Total (tok/s) | 11,425 |
Latency (Mean / P99 in ms):
| Metric | 8× B200 |
|---|---|
| TTFT | 1,943 / 5,029 |
| TPOT | 23 / 25 |
| ITL | 23 / 35 |
Token throughput:
| Metric | 8× B200 |
|---|---|
| Output gen (tok/s) | 1,268 |
| Total (tok/s) | 11,416 |
Latency (Mean / P99 in ms):
| Metric | 8× B200 |
|---|---|
| TTFT | 5,024 / 67,955 |
| TPOT | 20 / 21 |
| ITL | 20 / 166 |
Next steps
Upstream
Downstream
Verify tool-use with tau-bench
Confirm the model handles function-calling correctly before using it in production with openbench:
uv run --with openbench[tau_bench] bench eval tau_bench_retail \
--model openai/qwen \
-M base_url=http://localhost:8000/v1 \
--limit 10
Use as a Claude Code backend
Use your self-hosted model instead of Anthropic's API for local development:
export ANTHROPIC_BASE_URL=http://localhost:8000
claude