TL;DR: token throughput
| Hardware | Gen. throughput | TTFT | ITL |
|---|---|---|---|
| 2× NVIDIA B200 GPUs | 1,877 tok/s | 1,330ms | 16ms |
| 4× NVIDIA H100 GPUs | 1,810 tok/s | 1,960ms | 16ms |
| 4× NVIDIA A100 GPUs | 1,069 tok/s | 3,969ms | 26ms |
| Hardware | Gen. throughput | TTFT | ITL |
|---|---|---|---|
| 2× NVIDIA B200 GPUs | 1,721 tok/s | 4,602ms | 14ms |
| 4× NVIDIA H100 GPUs | 2,180 tok/s | 933ms | 14ms |
| 4× NVIDIA A100 GPUs | 851 tok/s | 6,997ms | 31ms |
Benchmark command
Re-run the benchmark:
vllm bench serve \
--model Qwen/Qwen3-Coder-Next \
--served-model-name qwen3-coder-next \
--endpoint /v1/chat/completions \
--random-input-len 8192 --random-output-len 1024 \
--num-prompts 512 --max-concurrency 32
(8192 in/1024 out tokens, 32 parallel requests)
Background
Qwen3-Coder-Next is an 80-billion-parameter Mixture of Experts (MoE) code model from Alibaba's Qwen team, activating only 3 billion parameters per forward pass. It combines the hybrid Gated DeltaNet architecture from Qwen-Next with specialized code training, achieving 70.6% on SWE-Bench Verified, the highest reported score for an open-weight model (at the time of this model's release). The model was trained on 800k agentic coding tasks using reinforcement learning with execution-based rewards.
The model's strong agentic performance comes from several key design choices:
- Hybrid architecture: Alternating Gated DeltaNet (linear attention) and full attention layers in a 3:1 ratio, reducing KV-cache memory while maintaining reasoning quality
- Agentic RL training: 800K tasks requiring multi-step tool use, file editing, and test execution
- Multi-token prediction: Trained to predict multiple tokens simultaneously, enabling speculative decoding for faster inference
Qwen3-Coder-Next supports 256k tokens natively, making it well-suited for repository-scale code understanding and long-running agentic workflows.
Model specifications
Overview
- Name: Qwen3-Coder-Next
- Author: Alibaba Cloud
- Architecture: MoE + Gated DeltaNet
- License: Apache-2.0
Specifications
- Total parameters: 80B (3B active per forward pass)
- Context window: 256k tokens
Hardware requirements
- Minimal deployment:
- 2× NVIDIA B200 GPUs (
--tp-size 2) - 4× NVIDIA H100 GPUs (
--tp-size 4) - 4× NVIDIA A100 GPUs (
--tp-size 4)
- 2× NVIDIA B200 GPUs (
Deployment and benchmarking
Deploying Qwen3-Coder-Next
Qwen3-Coder-Next requires 2× NVIDIA B200 GPUs, 4× NVIDIA H100 GPUs, or 4× NVIDIA A100 GPUs.
- Launch an instance with 2× NVIDIA B200 GPUs, 4× NVIDIA H100 GPUs, or 4× NVIDIA A100 from the Lambda Cloud Console using the GPU Base 24.04 image.
- Connect to your instance via SSH or JupyterLab terminal. See Connecting to an instance for detailed instructions.
- Start the inference server:
# Use --tp-size 2 for 2× B200, --tp-size 4 for 4× H100 or 4× A100
docker run \
--gpus all \
-p 8000:8000 \
--ipc=host \
-e HF_HOME=/root/.cache/huggingface \
-v ~/.cache/huggingface:/root/.cache/huggingface \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--host 0.0.0.0 \
--port 8000 \
--model-path Qwen/Qwen3-Coder-Next \
--served-model-name qwen3-coder-next \
--tool-call-parser qwen3_coder \
--tp-size 4 \
--trust-remote-code \
--mem-fraction-static 0.85
# Use --tensor-parallel-size 2 for 2× B200, --tensor-parallel-size 4 for 4× H100 or 4× A100
docker run \
--gpus all \
-p 8000:8000 \
--ipc=host \
-e HF_HOME=/root/.cache/huggingface \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--host 0.0.0.0 \
--port 8000 \
--model Qwen/Qwen3-Coder-Next \
--served-model-name qwen3-coder-next \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--tensor-parallel-size 4 \
--trust-remote-code
This launches an inference server with an OpenAI-compatible API on port 8000.
- Verify the server is running:
curl -X GET http://localhost:8000/v1/models \
-H "Content-Type: application/json"
You should see qwen3-coder-next listed in the response.
Benchmarking results: Qwen3-Coder-Next
Token throughput:
| Metric | 2× B200 | 4× H100 | 4× A100 |
|---|---|---|---|
| Output gen (tok/s) | 1,876 | 1,810 | 1,069 |
| Total (tok/s) | 16,890 | 16,289 | 9,622 |
Latency (Mean / P99 in ms):
| Metric | 2× B200 | 4× H100 | 4× A100 |
|---|---|---|---|
| TTFT | 1,330 / 3,644 | 1,960 / 4,489 | 3,969 / 8,901 |
| TPOT | 16 / 17 | 16 / 18 | 26 / 30 |
| ITL | 16 / 28 | 16 / 31 | 26 / 46 |
Token throughput:
| Metric | 2× B200 | 4× H100 | 4× A100 |
|---|---|---|---|
| Output gen (tok/s) | 1,721 | 2,180 | 851 |
| Total (tok/s) | 15,492 | 19,617 | 7,659 |
Latency (Mean / P99 in ms):
| Metric | 2× B200 | 4× H100 | 4× A100 |
|---|---|---|---|
| TTFT | 4,601 / 65,289 | 933 / 4,268 | 6,997 / 96,435 |
| TPOT | 14 / 14 | 14 / 15 | 31 / 33 |
| ITL | 14 / 109 | 14 / 117 | 31 / 141 |
Next steps
Upstream
Downstream
Verify tool-use with tau-bench
Confirm the model handles function-calling correctly before using it in production with openbench:
uv run --with openbench[tau_bench] bench eval tau_bench_retail \
--model openai/qwen3-coder-next \
-M base_url=http://localhost:8000/v1 \
--limit 10
Use as a Claude Code backend
Use your self-hosted model instead of Anthropic's API for local development:
export ANTHROPIC_BASE_URL=http://localhost:8000
claude