How to deploy Qwen3.5-122B-A10B on Lambda

TL;DR: token throughput

Hardware Gen. throughput TTFT ITL
4× B200 2,197 tok/s 1,156ms 13ms
8× H100 1,585 tok/s 2,613ms 18ms
8× A100 930 tok/s 4,602ms 30ms
Hardware Gen. throughput TTFT ITL
4× B200 1,817 tok/s 4,904ms 13ms
8× H100 1,843 tok/s 1,060ms 16ms
8× A100 744 tok/s 7,612ms 35ms

Benchmark command

Re-run the benchmark:

vllm bench serve \
  --model Qwen/Qwen3.5-122B-A10B \
  --served-model-name qwen35-122b \
  --endpoint /v1/chat/completions \
  --random-input-len 8192 --random-output-len 1024 \
  --num-prompts 512 --max-concurrency 32

(8192 in/1024 out tokens, 32 parallel requests)

Background

Qwen3.5-122B-A10B is part of the Qwen3.5 model family, released alongside the flagship Qwen3.5-397B-A17B. The family includes a range of MoE and dense models to suit different deployment constraints:

  • MoE models: 397B-A17B, 122B-A10B, 35B-A3B (the "A" indicates active parameters per forward pass)
  • Dense model: 27B (standard transformer, no routing overhead)
  • Base variants: Available with -Base suffix for fine-tuning

With 122 billion total parameters and only 10 billion active per token, Qwen3.5-122B-A10B offers a middle ground between the flagship 397B and smaller models. It uses the same hybrid Gated DeltaNet + MoE architecture, combining linear attention layers with full attention in a 3:1 ratio for efficient long-context processing.

The model supports 256k tokens natively and shares the same training innovations as its larger sibling: multi-token prediction for speculative decoding, 512 experts with sparse activation, and unified vision-language capabilities.

Model specifications

Overview

  • Name: Qwen3.5-122B-A10B
  • Author: Alibaba Cloud
  • Architecture: MoE + Gated DeltaNet
  • License: Apache-2.0

Specifications

  • Total parameters: 122B (10B active per forward pass)
  • Context window: 256k tokens
  • Languages: 201 languages and dialects

Hardware requirements

  • Minimal deployment:
    • 4× NVIDIA B200 GPU (--tp-size 4)
    • 8× NVIDIA H100 GPU (--tp-size 8)
    • 8× NVIDIA A100 GPU (--tp-size 8)

Deployment and benchmarking

Deploying Qwen3.5-122B-A10B

Qwen3.5-122B-A10B requires 4× NVIDIA B200 GPU, 8× NVIDIA H100 GPU, or 8× NVIDIA A100 GPU to load the model.

  1. Launch an instance with 4× B200, 8× H100, or 8× A100 from the Lambda Cloud Console using the GPU Base 24.04 image.
  2. Connect to your instance via SSH or JupyterLab terminal. See Connecting to an instance for detailed instructions.
  3. Start the inference server:
# Use --tp-size 4 for 4× B200, --tp-size 8 for 8× H100 or 8× A100
docker run \
    --gpus all \
    -p 8000:8000 \
    --ipc=host \
    -e HF_HOME=/root/.cache/huggingface \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    lmsysorg/sglang:latest \
    python -m sglang.launch_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model-path Qwen/Qwen3.5-122B-A10B \
    --served-model-name qwen35-122b \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --tp-size 8 \
    --trust-remote-code \
    --mem-fraction-static 0.85
# Use --tensor-parallel-size 4 for 4× B200, --tensor-parallel-size 8 for 8× H100 or 8× A100
docker run \
    --gpus all \
    -p 8000:8000 \
    --ipc=host \
    -e HF_HOME=/root/.cache/huggingface \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vllm/vllm-openai:latest \
    --host 0.0.0.0 \
    --port 8000 \
    --model Qwen/Qwen3.5-122B-A10B \
    --served-model-name qwen35-122b \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --tensor-parallel-size 8 \
    --trust-remote-code

This launches an inference server with an OpenAI-compatible API on port 8000.

  1. Verify the server is running:
curl -X GET http://localhost:8000/v1/models \
  -H "Content-Type: application/json"

You should see qwen35-122b listed in the response.

Benchmarking results: Qwen3.5-122B-A10B

Token throughput:

Metric 4× B200 8× H100 8× A100
Output gen (tok/s) 2,197 1,585 930
Total (tok/s) 19,770 14,262 8,372

Latency (Mean / P99 in ms):

Metric 4× B200 8× H100 8× A100
TTFT 1,156 / 3,226 2,613 / 5,533 4,602 / 9,964
TPOT 13 / 14 18 / 20 30 / 34
ITL 13 / 15 18 / 37 30 / 93

Token throughput:

Metric 4× B200 8× H100 8× A100
Output gen (tok/s) 1,817 1,843 744
Total (tok/s) 16,355 16,589 6,700

Latency (Mean / P99 in ms):

Metric 4× B200 8× H100 8× A100
TTFT 4,904 / 68,885 1,060 / 5,328 7,612 / 105,377
TPOT 13 / 13 16 / 17 36 / 41
ITL 13 / 102 16 / 135 35 / 181

Next steps

Upstream

Downstream

Verify tool-use with tau-bench

Confirm the model handles function-calling correctly before using it in production with openbench:

uv run --with openbench[tau_bench] bench eval tau_bench_retail \
  --model openai/qwen35-122b \
  -M base_url=http://localhost:8000/v1 \
  --limit 10

Use as a Claude Code backend

Use your self-hosted model instead of Anthropic's API for local development:

export ANTHROPIC_BASE_URL=http://localhost:8000
claude

Ready to get started?

Create your Lambda Cloud account and launch NVIDIA GPU instances in minutes. Looking for long-term capacity? Talk to our team.