TL;DR: token throughput
| Hardware | Gen. throughput | Per-user gen. | Total throughput | TTFT | ITL |
| 1× NVIDIA HGX B200 | 1,345 tok/s | 42.0 tok/s/user | 6,727 tok/s | 1,073ms | 59ms |
| Hardware | Gen. throughput | Per-user gen. | Total throughput | TTFT | ITL |
| 1× NVIDIA HGX B200 | 1,265 tok/s | 39.5 tok/s/user | 6,327 tok/s | 1,317ms | 58ms |
Benchmark command
The benchmark uses a 4:1 input-to-output token ratio (8,192 in / 2,048 out per request, 512 prompts, 32 parallel requests) to simulate a realistic Claude Code backend workload, where large tool results, file reads, and conversation history dominate the input while the model returns a moderate response plus thinking-mode reasoning. Per-user generation throughput is aggregate generation tok/s divided by the concurrency level, or the rate each individual user sees their response streaming back.
Re-run the benchmark:
vllm bench serve \
--model zai-org/GLM-5.1-FP8 \
--served-model-name glm-5.1-fp8 \
--endpoint /v1/chat/completions \
--random-input-len 8192 --random-output-len 2048 \
--num-prompts 512 --max-concurrency 32
(8192 in / 2048 out tokens, 32 parallel requests)
Background
GLM-5.1 is an incremental update to Z.ai's GLM-5, maintaining the same architecture as GLM-5 (40B active per token, 256 routed experts with top-8 routing plus 1 shared expert, Multi-head Latent Attention combined with DeepSeek Sparse Attention, and a Multi-Token Prediction head for speculative decoding).
Compared with GLM-5, GLM-5.1 delivers significant improvements in coding, agentic tool use, reasoning, role-play, and long-horizon agentic tasks (e.g. CUDA kernel optimization):
| Benchmark | GLM-5.1 (FP8) | GLM-5 (FP8, reference) |
| Terminal-Bench 2 | 63.5 | 56.2 |
| AIME 25 | 95.3 | — |
| HLE | 31.0 | 30.5 |
| GPQA | 86.2 | — |
| SWE-Bench Pro | 58.4 | — |
The biggest behavioral change vs GLM-5 is that thinking mode is enabled by default. GLM-5.1 uses the same Interleaved and Preserved Thinking patterns as GLM-5: use Interleaved Thinking for general chat, and Interleaved + Preserved Thinking for agentic workflows such as Claude Code, Roo Code, or Kilo Code. See Z.ai's thinking-mode docs for details.
The chat template also changes: GLM-5.1 supports Claude-style deferred tool loading (tools with defer_loading=True do not appear in the system prompt; they appear in tool results instead), allows empty reasoning content in assistant messages, and accepts both List[tool] and List[tool.function] for SGLang compatibility.
Model specifications
Overview
- Name: GLM-5.1
- Author: Z.ai (zai-org)
- Architecture: MoE with Multi-head Latent Attention (MLA) + DeepSeek Sparse Attention (DSA), Multi-Token Prediction (MTP) head
- License: MIT
Specifications
- Total parameters: 744B (40B active per token)
- Routed experts: 256 (top-8 active per token) + 1 shared expert; first 3 layers dense
- Context window: 202,752 tokens
Recommended Lambda VRAM configuration
- Minimal deployment:
- 1× NVIDIA HGX B200 (8× NVIDIA B200 GPU system) is required to load the 744B-parameter model. Use the FP8 quantized version (
zai-org/GLM-5.1-FP8) for the fastest throughput.
- 1× NVIDIA HGX B200 (8× NVIDIA B200 GPU system) is required to load the 744B-parameter model. Use the FP8 quantized version (
Deployment and benchmarking
Deploying GLM-5.1
GLM-5.1 requires, at minimum, 1× NVIDIA HGX B200 to load the 744B-parameter model.
- Launch an instance from the Lambda Cloud Console using the GPU Base 24.04 image (
1x NVIDIA HGX B200 (8x NVIDIA B200 GPUs)). - Connect to your instance via SSH or JupyterLab terminal. See Connecting to an instance for detailed instructions.
- Start the inference server using either SGLang or vLLM:
Important: Use SGLang 0.5.10 — not
0.5.10rc0, which has a known flashmla bug that has been fixed in the 0.5.10 release.
docker run \
--gpus all \
-p 8000:8000 \
--ipc=host \
-e HF_HOME=/root/.cache/huggingface \
-v ~/.cache/huggingface:/root/.cache/huggingface \
lmsysorg/sglang:v0.5.10-cu12 \
python -m sglang.launch_server \
--model-path zai-org/GLM-5.1-FP8 \
--host 0.0.0.0 \
--port 8000 \
--tp-size 8 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.85 \
--served-model-name glm-5.1-fp8
Important: Upgrade
transformersto v5.3.0 or later inside the vLLM container before launching. The base vLLM 0.19.0 image ships an older transformers that does not yet support GLM-5.1's chat template.
docker run \
--gpus all \
-p 8000:8000 \
--ipc=host \
-e HF_HOME=/root/.cache/huggingface \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--entrypoint /bin/bash \
vllm/vllm-openai:v0.19.0 \
-c "pip install --upgrade 'transformers>=5.3.0' && \
vllm serve zai-org/GLM-5.1-FP8 \
--tensor-parallel-size 8 \
--max-model-len 202752 \
--max-num-seqs 64 \
--trust-remote-code \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 3 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-5.1-fp8 \
--port 8000 \
--host 0.0.0.0"
Verify the server
Either command launches an inference server with an OpenAI-compatible API on port 8000. Verify it:
curl -X GET http://localhost:8000/v1/models \
-H "Content-Type: application/json"
You should see glm-5.1-fp8 listed in the response.
Recommended inference parameters
Z.ai's recommended defaults for GLM-5.1:
| Scenario | Temperature | Top-p | Max new tokens | Notes |
| Default (most tasks) | 1.0 | 0.95 | 131,072 | Interleaved Thinking (default) |
| Terminal-Bench 2 | 0.7 | 1.0 | 16,384 | Preserved Thinking ON; context 202,752 |
| τ²-Bench (tool use) | 0 | — | 16,384 | Preserved Thinking ON |
For multi-turn agentic tasks (τ²-Bench, Terminal-Bench 2), enable Preserved Thinking by sending chat_template_kwargs with enable_thinking: true and clear_thinking: false. To disable thinking entirely, send chat_template_kwargs: {"enable_thinking": false}.
Benchmarking results: GLM-5.1
Token throughput:
| Metric | 1× NVIDIA HGX B200 |
| Output gen (tok/s) | 1,345.44 |
| Per-user gen (tok/s/user) | 42.04 |
| Total (tok/s) | 6,727.19 |
Latency (Mean / P99 in ms):
| Metric | 1× NVIDIA HGX B200 |
| TTFT | 1,072.61 / 14,237.85 |
| TPOT | 22.83 / 30.60 |
| ITL | 58.61 / 452.50 |
Token throughput:
| Metric | 1× NVIDIA HGX B200 |
| Output gen (tok/s) | 1,265.45 |
| Per-user gen (tok/s/user) | 39.55 |
| Total (tok/s) | 6,327.24 |
Latency (Mean / P99 in ms):
| Metric | 1× NVIDIA HGX B200 |
| TTFT | 1,316.86 / 14,853.08 |
| TPOT | 24.34 / 30.26 |
| ITL | 57.79 / 530.82 |
Next steps
To get started with GLM-5.1, follow the directions above. Read more about the model:
- Download GLM-5.1-FP8 on Hugging Face
- Z.ai thinking-mode documentation
- SGLang GLM-5 cookbook (same architecture, applies to GLM-5.1)