How to deploy Qwen3-Coder-Next on Lambda

TL;DR: token throughput

Hardware Gen. throughput TTFT ITL
2× NVIDIA B200 GPUs 1,877 tok/s 1,330ms 16ms
4× NVIDIA H100 GPUs 1,810 tok/s 1,960ms 16ms
4× NVIDIA A100 GPUs 1,069 tok/s 3,969ms 26ms
Hardware Gen. throughput TTFT ITL
2× NVIDIA B200 GPUs 1,721 tok/s 4,602ms 14ms
4× NVIDIA H100 GPUs 2,180 tok/s 933ms 14ms
4× NVIDIA A100 GPUs 851 tok/s 6,997ms 31ms

Benchmark command

Re-run the benchmark:

vllm bench serve \
  --model Qwen/Qwen3-Coder-Next \
  --served-model-name qwen3-coder-next \
  --endpoint /v1/chat/completions \
  --random-input-len 8192 --random-output-len 1024 \
  --num-prompts 512 --max-concurrency 32

(8192 in/1024 out tokens, 32 parallel requests)

Background

Qwen3-Coder-Next is an 80-billion-parameter Mixture of Experts (MoE) code model from Alibaba's Qwen team, activating only 3 billion parameters per forward pass. It combines the hybrid Gated DeltaNet architecture from Qwen-Next with specialized code training, achieving 70.6% on SWE-Bench Verified, the highest reported score for an open-weight model (at the time of this model's release). The model was trained on 800k agentic coding tasks using reinforcement learning with execution-based rewards.

The model's strong agentic performance comes from several key design choices:

  • Hybrid architecture: Alternating Gated DeltaNet (linear attention) and full attention layers in a 3:1 ratio, reducing KV-cache memory while maintaining reasoning quality
  • Agentic RL training: 800K tasks requiring multi-step tool use, file editing, and test execution
  • Multi-token prediction: Trained to predict multiple tokens simultaneously, enabling speculative decoding for faster inference

Qwen3-Coder-Next supports 256k tokens natively, making it well-suited for repository-scale code understanding and long-running agentic workflows.

Model specifications

Overview

  • Name: Qwen3-Coder-Next
  • Author: Alibaba Cloud
  • Architecture: MoE + Gated DeltaNet
  • License: Apache-2.0

Specifications

  • Total parameters: 80B (3B active per forward pass)
  • Context window: 256k tokens

Hardware requirements

  • Minimal deployment:
    • 2× NVIDIA B200 GPUs (--tp-size 2)
    • 4× NVIDIA H100 GPUs (--tp-size 4)
    • 4× NVIDIA A100 GPUs (--tp-size 4)

Deployment and benchmarking

Deploying Qwen3-Coder-Next

Qwen3-Coder-Next requires 2× NVIDIA B200 GPUs, 4× NVIDIA H100 GPUs, or 4× NVIDIA A100 GPUs.

  1. Launch an instance with 2× NVIDIA B200 GPUs, 4× NVIDIA H100 GPUs, or 4× NVIDIA A100 from the Lambda Cloud Console using the GPU Base 24.04 image.
  2. Connect to your instance via SSH or JupyterLab terminal. See Connecting to an instance for detailed instructions.
  3. Start the inference server:
# Use --tp-size 2 for 2× B200, --tp-size 4 for 4× H100 or 4× A100
docker run \
    --gpus all \
    -p 8000:8000 \
    --ipc=host \
    -e HF_HOME=/root/.cache/huggingface \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    lmsysorg/sglang:latest \
    python -m sglang.launch_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model-path Qwen/Qwen3-Coder-Next \
    --served-model-name qwen3-coder-next \
    --tool-call-parser qwen3_coder \
    --tp-size 4 \
    --trust-remote-code \
    --mem-fraction-static 0.85
# Use --tensor-parallel-size 2 for 2× B200, --tensor-parallel-size 4 for 4× H100 or 4× A100
docker run \
    --gpus all \
    -p 8000:8000 \
    --ipc=host \
    -e HF_HOME=/root/.cache/huggingface \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vllm/vllm-openai:latest \
    --host 0.0.0.0 \
    --port 8000 \
    --model Qwen/Qwen3-Coder-Next \
    --served-model-name qwen3-coder-next \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --tensor-parallel-size 4 \
    --trust-remote-code

This launches an inference server with an OpenAI-compatible API on port 8000.

  1. Verify the server is running:
curl -X GET http://localhost:8000/v1/models \
  -H "Content-Type: application/json"

You should see qwen3-coder-next listed in the response.

Benchmarking results: Qwen3-Coder-Next

Token throughput:

Metric 2× B200 4× H100 4× A100
Output gen (tok/s) 1,876 1,810 1,069
Total (tok/s) 16,890 16,289 9,622

Latency (Mean / P99 in ms):

Metric 2× B200 4× H100 4× A100
TTFT 1,330 / 3,644 1,960 / 4,489 3,969 / 8,901
TPOT 16 / 17 16 / 18 26 / 30
ITL 16 / 28 16 / 31 26 / 46

Token throughput:

Metric 2× B200 4× H100 4× A100
Output gen (tok/s) 1,721 2,180 851
Total (tok/s) 15,492 19,617 7,659

Latency (Mean / P99 in ms):

Metric 2× B200 4× H100 4× A100
TTFT 4,601 / 65,289 933 / 4,268 6,997 / 96,435
TPOT 14 / 14 14 / 15 31 / 33
ITL 14 / 109 14 / 117 31 / 141

Next steps

Upstream

Downstream

Verify tool-use with tau-bench

Confirm the model handles function-calling correctly before using it in production with openbench:

uv run --with openbench[tau_bench] bench eval tau_bench_retail \
  --model openai/qwen3-coder-next \
  -M base_url=http://localhost:8000/v1 \
  --limit 10

Use as a Claude Code backend

Use your self-hosted model instead of Anthropic's API for local development:

export ANTHROPIC_BASE_URL=http://localhost:8000
claude

Ready to get started?

Create your Lambda Cloud account and launch NVIDIA GPU instances in minutes. Looking for long-term capacity? Talk to our team.