TL;DR: token throughput
| Hardware | Gen. throughput | TTFT | ITL |
|---|---|---|---|
| 4× B200 | 2,197 tok/s | 1,156ms | 13ms |
| 8× H100 | 1,585 tok/s | 2,613ms | 18ms |
| 8× A100 | 930 tok/s | 4,602ms | 30ms |
| Hardware | Gen. throughput | TTFT | ITL |
|---|---|---|---|
| 4× B200 | 1,817 tok/s | 4,904ms | 13ms |
| 8× H100 | 1,843 tok/s | 1,060ms | 16ms |
| 8× A100 | 744 tok/s | 7,612ms | 35ms |
Benchmark command
Re-run the benchmark:
vllm bench serve \
--model Qwen/Qwen3.5-122B-A10B \
--served-model-name qwen35-122b \
--endpoint /v1/chat/completions \
--random-input-len 8192 --random-output-len 1024 \
--num-prompts 512 --max-concurrency 32
(8192 in/1024 out tokens, 32 parallel requests)
Background
Qwen3.5-122B-A10B is part of the Qwen3.5 model family, released alongside the flagship Qwen3.5-397B-A17B. The family includes a range of MoE and dense models to suit different deployment constraints:
- MoE models: 397B-A17B, 122B-A10B, 35B-A3B (the "A" indicates active parameters per forward pass)
- Dense model: 27B (standard transformer, no routing overhead)
- Base variants: Available with
-Basesuffix for fine-tuning
With 122 billion total parameters and only 10 billion active per token, Qwen3.5-122B-A10B offers a middle ground between the flagship 397B and smaller models. It uses the same hybrid Gated DeltaNet + MoE architecture, combining linear attention layers with full attention in a 3:1 ratio for efficient long-context processing.
The model supports 256k tokens natively and shares the same training innovations as its larger sibling: multi-token prediction for speculative decoding, 512 experts with sparse activation, and unified vision-language capabilities.
Model specifications
Overview
- Name: Qwen3.5-122B-A10B
- Author: Alibaba Cloud
- Architecture: MoE + Gated DeltaNet
- License: Apache-2.0
Specifications
- Total parameters: 122B (10B active per forward pass)
- Context window: 256k tokens
- Languages: 201 languages and dialects
Hardware requirements
- Minimal deployment:
- 4× NVIDIA B200 GPU (
--tp-size 4) - 8× NVIDIA H100 GPU (
--tp-size 8) - 8× NVIDIA A100 GPU (
--tp-size 8)
- 4× NVIDIA B200 GPU (
Deployment and benchmarking
Deploying Qwen3.5-122B-A10B
Qwen3.5-122B-A10B requires 4× NVIDIA B200 GPU, 8× NVIDIA H100 GPU, or 8× NVIDIA A100 GPU to load the model.
- Launch an instance with 4× B200, 8× H100, or 8× A100 from the Lambda Cloud Console using the GPU Base 24.04 image.
- Connect to your instance via SSH or JupyterLab terminal. See Connecting to an instance for detailed instructions.
- Start the inference server:
# Use --tp-size 4 for 4× B200, --tp-size 8 for 8× H100 or 8× A100
docker run \
--gpus all \
-p 8000:8000 \
--ipc=host \
-e HF_HOME=/root/.cache/huggingface \
-v ~/.cache/huggingface:/root/.cache/huggingface \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--host 0.0.0.0 \
--port 8000 \
--model-path Qwen/Qwen3.5-122B-A10B \
--served-model-name qwen35-122b \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--tp-size 8 \
--trust-remote-code \
--mem-fraction-static 0.85
# Use --tensor-parallel-size 4 for 4× B200, --tensor-parallel-size 8 for 8× H100 or 8× A100
docker run \
--gpus all \
-p 8000:8000 \
--ipc=host \
-e HF_HOME=/root/.cache/huggingface \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--host 0.0.0.0 \
--port 8000 \
--model Qwen/Qwen3.5-122B-A10B \
--served-model-name qwen35-122b \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--tensor-parallel-size 8 \
--trust-remote-code
This launches an inference server with an OpenAI-compatible API on port 8000.
- Verify the server is running:
curl -X GET http://localhost:8000/v1/models \
-H "Content-Type: application/json"
You should see qwen35-122b listed in the response.
Benchmarking results: Qwen3.5-122B-A10B
Token throughput:
| Metric | 4× B200 | 8× H100 | 8× A100 |
|---|---|---|---|
| Output gen (tok/s) | 2,197 | 1,585 | 930 |
| Total (tok/s) | 19,770 | 14,262 | 8,372 |
Latency (Mean / P99 in ms):
| Metric | 4× B200 | 8× H100 | 8× A100 |
|---|---|---|---|
| TTFT | 1,156 / 3,226 | 2,613 / 5,533 | 4,602 / 9,964 |
| TPOT | 13 / 14 | 18 / 20 | 30 / 34 |
| ITL | 13 / 15 | 18 / 37 | 30 / 93 |
Token throughput:
| Metric | 4× B200 | 8× H100 | 8× A100 |
|---|---|---|---|
| Output gen (tok/s) | 1,817 | 1,843 | 744 |
| Total (tok/s) | 16,355 | 16,589 | 6,700 |
Latency (Mean / P99 in ms):
| Metric | 4× B200 | 8× H100 | 8× A100 |
|---|---|---|---|
| TTFT | 4,904 / 68,885 | 1,060 / 5,328 | 7,612 / 105,377 |
| TPOT | 13 / 13 | 16 / 17 | 36 / 41 |
| ITL | 13 / 102 | 16 / 135 | 35 / 181 |
Next steps
Upstream
Downstream
Verify tool-use with tau-bench
Confirm the model handles function-calling correctly before using it in production with openbench:
uv run --with openbench[tau_bench] bench eval tau_bench_retail \
--model openai/qwen35-122b \
-M base_url=http://localhost:8000/v1 \
--limit 10
Use as a Claude Code backend
Use your self-hosted model instead of Anthropic's API for local development:
export ANTHROPIC_BASE_URL=http://localhost:8000
claude