How to deploy Qwen3.5-397B-A17B on Lambda

TL;DR: token throughput (SGLang)

Hardware configuration Generation throughput (tok/s) Total throughput (tok/s) TTFT (ms) ITL (ms) Prompts Tokens in Tokens out Parallel requests
8× NVIDIA B200 GPU 1,232 11,092 1,825 24 512 4,194,304 524,288 32

The benchmark we test with uses an 8:1 input-to-output token ratio (8192 in/1024 out per request) to simulate coding workflows, where large code contexts are provided as input with shorter completions as output. This differs from chat assistant workloads, which typically have more balanced or output-heavy ratios.

Benchmark configuration:

vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen3.5-397B-A17B \
  --served-model-name qwen \
  --dataset-name random \
  --random-input-len 8192 \
  --random-output-len 1024 \
  --num-prompts 512 \
  --max-concurrency 32 \
  --endpoint /v1/chat/completions

See Benchmarking Qwen for the full command

Background

Qwen3.5-397B-A17B is a 397 billion parameter multimodal vision-language model from Alibaba's Qwen team, featuring a hybrid Gated DeltaNet and Mixture-of-Experts (MoE) architecture. With only 17 billion parameters active per forward pass, 23% fewer than the prior 235B version despite having 69% more total parameters, it achieves comparable or better performance at lower inference cost.

The model's improved performance is due to several key factors:

  • Gated Delta Networks (GDN): A hybrid attention architecture alternating between linear attention (Gated DeltaNet) and full attention layers in a 3:1 ratio, reducing KV-cache memory by approximately 4×
  • Scaling the MoE further: 512 experts (4× more than Qwen3's 128) with 10+1 active experts per token
  • Multi-token prediction: Enables speculative decoding for 2-3× inference speedup
  • Unified vision-language: Early fusion training on trillions of multimodal tokens

Qwen3.5 extends context to 256k tokens natively, making it well-suited for agentic workflows, long-context applications, and code analysis.

Model specifications

Overview

  • Name: Qwen3.5-397B-A17B
  • Author: Alibaba Cloud
  • Architecture: MoE
  • License: Apache-2.0

Specifications

  • Total parameters: 397B (17B active per forward pass)
  • Context window: 256k tokens
  • Languages: 201 languages and dialects

Hardware Requirements

  • Minimal deployment:
    • NVIDIA HGX B200 (1.5 TB)

Deployment and benchmarking

Deploying Qwen3.5-397B-A17B

Qwen3.5 requires NVIDIA HGX B200 to load the full 397B parameter model.

  1. Launch an instance with NVIDIA HGX B200 from the Lambda Cloud Console using the GPU Base 24.04 image.
  2. Connect to your instance via SSH or JupyterLab terminal. See Connecting to an instance for detailed instructions.
  3. Start the SGLang server:
docker run \
    --gpus all \
    -p 8000:8000 \
    --ipc=host \
    -e HF_HOME=/root/.cache/huggingface \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    lmsysorg/sglang:latest \
    python -m sglang.launch_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model-path Qwen/Qwen3.5-397B-A17B \
    --served-model-name qwen \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --mamba-ssm-dtype float32 \
    --tp-size 8 \
    --trust-remote-code \
    --mem-fraction-static 0.85

This launches an SGLang server with an OpenAI-compatible API on port 8000.

  1. Verify the server is running:
curl -X GET http://localhost:8000/v1/models \
  -H "Content-Type: application/json"

You should see Qwen listed in the response.

Benchmarking Qwen3.5

You can benchmark Qwen using vllm bench serve. The benchmark results in this article use an 8192 input/1024 output token workload to simulate coding assistant patterns.

Here's a minimal example to run against your server:

vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen3.5-397B-A17B \
  --served-model-name qwen \
  --dataset-name random \
  --random-input-len 8192 \
  --random-output-len 1024 \
  --num-prompts 512 \
  --max-concurrency 32 \
  --endpoint /v1/chat/completions

Token throughput (NVIDIA HGX B200):

Metric Tokens per second
Output generation 1,232
Total (input & output) 11,092

Latency details (8× NVIDIA B200 GPU):

Metric Mean (ms) P99 (ms)
Time to first token 1,825 5,008
Time per output token 24 26
Inter-token latency 24 47

Next steps

To get started with Qwen3.5, follow the directions above to deploy on Lambda's NVIDIA-accelerated infrastructure. Check out more information about the model below:

Download the Qwen3.5-397B-A17B weights on Hugging Face.

Ready to get started?

Create your Lambda Cloud account and launch NVIDIA GPU instances in minutes. Looking for long-term capacity? Talk to our team.