TL;DR: token throughput (SGLang)
| Hardware configuration | Generation throughput (tok/s) | Total throughput (tok/s) | TTFT (ms) | ITL (ms) | Prompts | Tokens in | Tokens out | Parallel requests |
|---|---|---|---|---|---|---|---|---|
| 8× NVIDIA B200 GPU | 1,232 | 11,092 | 1,825 | 24 | 512 | 4,194,304 | 524,288 | 32 |
The benchmark we test with uses an 8:1 input-to-output token ratio (8192 in/1024 out per request) to simulate coding workflows, where large code contexts are provided as input with shorter completions as output. This differs from chat assistant workloads, which typically have more balanced or output-heavy ratios.
Benchmark configuration:
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen3.5-397B-A17B \
--served-model-name qwen \
--dataset-name random \
--random-input-len 8192 \
--random-output-len 1024 \
--num-prompts 512 \
--max-concurrency 32 \
--endpoint /v1/chat/completions
See Benchmarking Qwen for the full command
Background
Qwen3.5-397B-A17B is a 397 billion parameter multimodal vision-language model from Alibaba's Qwen team, featuring a hybrid Gated DeltaNet and Mixture-of-Experts (MoE) architecture. With only 17 billion parameters active per forward pass, 23% fewer than the prior 235B version despite having 69% more total parameters, it achieves comparable or better performance at lower inference cost.
The model's improved performance is due to several key factors:
- Gated Delta Networks (GDN): A hybrid attention architecture alternating between linear attention (Gated DeltaNet) and full attention layers in a 3:1 ratio, reducing KV-cache memory by approximately 4×
- Scaling the MoE further: 512 experts (4× more than Qwen3's 128) with 10+1 active experts per token
- Multi-token prediction: Enables speculative decoding for 2-3× inference speedup
- Unified vision-language: Early fusion training on trillions of multimodal tokens
Qwen3.5 extends context to 256k tokens natively, making it well-suited for agentic workflows, long-context applications, and code analysis.
Model specifications
Overview
- Name: Qwen3.5-397B-A17B
- Author: Alibaba Cloud
- Architecture: MoE
- License: Apache-2.0
Specifications
- Total parameters: 397B (17B active per forward pass)
- Context window: 256k tokens
- Languages: 201 languages and dialects
Hardware Requirements
- Minimal deployment:
- NVIDIA HGX B200 (1.5 TB)
Deployment and benchmarking
Deploying Qwen3.5-397B-A17B
Qwen3.5 requires NVIDIA HGX B200 to load the full 397B parameter model.
- Launch an instance with NVIDIA HGX B200 from the Lambda Cloud Console using the GPU Base 24.04 image.
- Connect to your instance via SSH or JupyterLab terminal. See Connecting to an instance for detailed instructions.
- Start the SGLang server:
docker run \
--gpus all \
-p 8000:8000 \
--ipc=host \
-e HF_HOME=/root/.cache/huggingface \
-v ~/.cache/huggingface:/root/.cache/huggingface \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--host 0.0.0.0 \
--port 8000 \
--model-path Qwen/Qwen3.5-397B-A17B \
--served-model-name qwen \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--mamba-ssm-dtype float32 \
--tp-size 8 \
--trust-remote-code \
--mem-fraction-static 0.85
This launches an SGLang server with an OpenAI-compatible API on port 8000.
- Verify the server is running:
curl -X GET http://localhost:8000/v1/models \
-H "Content-Type: application/json"
You should see Qwen listed in the response.
Benchmarking Qwen3.5
You can benchmark Qwen using vllm bench serve. The benchmark results in this article use an 8192 input/1024 output token workload to simulate coding assistant patterns.
Here's a minimal example to run against your server:
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen3.5-397B-A17B \
--served-model-name qwen \
--dataset-name random \
--random-input-len 8192 \
--random-output-len 1024 \
--num-prompts 512 \
--max-concurrency 32 \
--endpoint /v1/chat/completions
Token throughput (NVIDIA HGX B200):
| Metric | Tokens per second |
|---|---|
| Output generation | 1,232 |
| Total (input & output) | 11,092 |
Latency details (8× NVIDIA B200 GPU):
| Metric | Mean (ms) | P99 (ms) |
|---|---|---|
| Time to first token | 1,825 | 5,008 |
| Time per output token | 24 | 26 |
| Inter-token latency | 24 | 47 |
Next steps
To get started with Qwen3.5, follow the directions above to deploy on Lambda's NVIDIA-accelerated infrastructure. Check out more information about the model below: