TL;DR: token throughput (SGLang)
| Hardware configuration | Generation throughput (tok/s) | Total throughput (tok/s) | TTFT (ms) | ITL (ms) | Prompts | Tokens in | Tokens out | Parallel requests |
|---|---|---|---|---|---|---|---|---|
| 2× NVIDIA Blackwell GPU | 896 | 8,062 | 3,091 | 36 | 512 | 4,194,304 | 524,288 | 32 |
| 4× NVIDIA H100 Tensor Core GPU | 849 | 7,644 | 13,131 | 27 | 512 | 4,194,304 | 524,288 | 32 |
Benchmark configuration:
vllm bench serve \
--backend openai-chat \
--model MiniMaxAI/MiniMax-M2.5 \
--served-model-name minimax-m25 \
--dataset-name random \
--random-input-len 8192 \
--random-output-len 1024 \
--num-prompts 512 \
--max-concurrency 32 \
--endpoint /v1/chat/completions
See Benchmarking MiniMax-M2.5 for the full command.
Background
MiniMax-M2.5 is a 229B parameter Mixture-of-Experts (MoE) language model with only 10B parameters active per token. Positioned as "the world's first production-level model designed natively for agent scenarios," M2.5 uses 256 experts with 8 activated per token, enabling efficient inference while maintaining frontier-level capabilities.
MiniMax trained using MiniMax's novel Forge framework, an agent-native RL (reinforcement learning) training system that achieves 40x training speedup through prefix-tree merging and windowed FIFO scheduling. M2.5 introduces task completion time optimization, a reward shaping approach that balances intelligence with speed, resulting in 37% faster task completion than its predecessor M2.1.
On benchmarks, M2.5 achieves 80.2% on SWE-Bench Verified (up from M2.1's 74.0%), matching Claude Opus 4.6's speed while costing 10-20x less. The model also demonstrates strong cross-scaffold generalization, performing consistently across different agent frameworks, including Claude Code (80.2%), Droid (79.7%), and OpenCode (76.1%).
Model specifications
Overview
- Name: MiniMax-M2.5
- Author: MiniMaxAI
- Architecture: MoE
- License: MiniMax-Open (Modified MIT)
Specifications
- Total parameters: 229B (10B active per token)
- Context window: 192k tokens
- Languages: English, Spanish, French, German, Italian, Portuguese, Russian, Arabic, Hindi, Korean, Chinese, Japanese, Indonesian, Vietnamese, and Bengali
Recommended Lambda VRAM configuration
- Minimal deployment:
- 2× B200 (
--tp-size 2) - 4× H100 (
--tp-size 4)
- 2× B200 (
Deployment and benchmarking
Deploying MiniMax-M2.5
MiniMax-M2.5 requires 2× B200 or 4× H100 GPUs to load the model.
- Launch an instance with 2× B200 or 4× H100 from the Lambda Cloud Console using the GPU Base 24.04 image.
- Connect to your instance via SSH or JupyterLab terminal. See Connecting to an instance for detailed instructions.
- Start the SGLang server:
# Use --tp-size 2 for 2× B200, --tp-size 4 for 4× H100
docker run \
--gpus all \
-p 8000:8000 \
--ipc=host \
-e HF_HOME=/root/.cache/huggingface \
-v ~/.cache/huggingface:/root/.cache/huggingface \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--host 0.0.0.0 \
--port 8000 \
--model-path MiniMaxAI/MiniMax-M2.5 \
--served-model-name minimax-m25 \
--tp-size 2 \ # Or 4 for H100
--tool-call-parser minimax-m2 \
--reasoning-parser minimax-append-think \
--trust-remote-code \
--mem-fraction-static 0.85
This launches an SGLang server with an OpenAI-compatible API on port 8000.
- Verify the server is running:
curl -X GET http://localhost:8000/v1/models \
-H "Content-Type: application/json"
You should see MiniMax-M2.5 listed in the response.
Benchmarking MiniMax-M2.5
You can benchmark MiniMax-M2.5 using vllm bench serve. The benchmark results in this article use an 8192 input/1024 output token workload.
Here's a minimal example to run against your server:
vllm bench serve \
--backend openai-chat \
--model MiniMaxAI/MiniMax-M2.5 \
--served-model-name minimax-m25 \
--dataset-name random \
--random-input-len 8192 \
--random-output-len 1024 \
--num-prompts 512 \
--max-concurrency 32 \
--endpoint /v1/chat/completions
Token throughput (2× NVIDIA B200):
| Metric | Tokens per second |
|---|---|
| Output generation | 896 |
| Total (input & output) | 8,062 |
Latency details (2× NVIDIA B200):
| Metric | Mean (ms) | P99 (ms) |
|---|---|---|
| Time to first token | 3,091 | 7,552 |
| Time per output token | 33 | 35 |
| Inter-token latency | 36 | 58 |
Next steps
To get started with MiniMax-M2.5, follow the directions above to deploy on Lambda's infrastructure accelerated by NVIDIA. View additional resources about the model below: