How to deploy Nanbeige4.1-3B on Lambda

TL;DR: token throughput

Hardware Gen. throughput TTFT ITL
1× NVIDIA B200 GPU 4,547 tok/s 766ms 6ms
1× NVIDIA H100 GPU 2,381 tok/s 1,619ms 12ms
1× NVIDIA A100 GPU 1,174 tok/s 3,830ms 29ms
Hardware Gen. throughput TTFT ITL
1× NVIDIA B200 GPU 4,806 tok/s 526ms 6ms
1× NVIDIA H100 GPU 2,472 tok/s 822ms 12ms
1× NVIDIA A100 GPU 1,050 tok/s 1,480ms 29ms

Benchmark command

Re-run the benchmark:

vllm bench serve \
  --model Nanbeige/Nanbeige4.1-3B \
  --served-model-name nanbeige4.1-3b \
  --endpoint /v1/chat/completions \
  --random-input-len 8192 --random-output-len 1024 \
  --num-prompts 512 --max-concurrency 32

(8192 in / 1024 out tokens, 32 parallel requests)

Background

Nanbeige4.1-3B is a 3-billion-parameter code model designed for deep-search agentic workflows, where models must reason through 500+ tool-calling rounds to solve complex problems. Despite its small size, it achieves 76.9 on LiveCodeBench-V6 and 30.5 on SWE-Bench Verified, competitive with models 10x larger, while being efficient enough to run on a single GPU.

The model's strong performance relative to its size comes from targeted training innovations:

  • Complexity-aware code rewards: Reinforcement learning rewards scale with problem difficulty, preventing the model from gaming easy tasks.
  • Deep-search curriculum: Training on multi-step reasoning chains with hundreds of tool calls, not just single-turn completions.
  • Extended context: 262k token context window enables processing entire repositories without chunking.

Nanbeige4.1-3B is positioned as the first small model optimized for deep-search scenarios, making it ideal for edge deployment or cost-sensitive production workloads requiring agentic capabilities. Test it thoroughly on your own benchmarks, as this is one particular model that gets branded a "benchmax" (trained to optimize specific benchmarks).

Model specifications

Overview

  • Name: Nanbeige4.1-3B
  • Author: Nanbeige LLM Lab
  • Architecture: Dense Transformer
  • License: Apache-2.0

Specifications

  • Total parameters: 3B
  • Context window: 262k tokens
  • Languages: English, Chinese

Hardware requirements

  • Minimal deployment:
    • 1× NVIDIA B200 GPU
    • 1× NVIDIA H100 GPU

Deployment and benchmarking

Deploying Nanbeige4.1-3B

Nanbeige4.1-3B can run on a single NVIDIA B200 GPU or a single NVIDIA H100 GPU.

  1. Launch an instance with 1× NVIDIA B200 GPU or 1× NVIDIA H100 GPU from the Lambda Cloud Console using the GPU Base 24.04 image.
  2. Connect to your instance via SSH or JupyterLab terminal. See Connecting to an instance for detailed instructions.
  3. Start the inference server:
docker run \
    --gpus all \
    -p 8000:8000 \
    --ipc=host \
    -e HF_HOME=/root/.cache/huggingface \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    lmsysorg/sglang:latest \
    python -m sglang.launch_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model-path Nanbeige/Nanbeige4.1-3B \
    --served-model-name nanbeige4.1-3b \
    --trust-remote-code \
    --mem-fraction-static 0.85
docker run \
    --gpus all \
    -p 8000:8000 \
    --ipc=host \
    -e HF_HOME=/root/.cache/huggingface \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vllm/vllm-openai:latest \
    --host 0.0.0.0 \
    --port 8000 \
    --model Nanbeige/Nanbeige4.1-3B \
    --served-model-name nanbeige4.1-3b \
    --trust-remote-code

This launches an inference server with an OpenAI-compatible API on port 8000.

  1. Verify the server is running:
curl -X GET http://localhost:8000/v1/models \
  -H "Content-Type: application/json"

You should see nanbeige4.1-3b listed in the response.

Benchmarking results: Nanbeige4.1-3B

Token throughput:

Metric 1× B200 1× H100 1× A100
Output gen (tok/s) 4,547 2,381 1,174
Total (tok/s) 40,926 21,430 10,571

Latency (Mean / P99 in ms):

Metric 1× B200 1× H100 1× A100
TTFT 766 / 2,214 1,619 / 3,555 3,830 / 9,784
TPOT 6 / 7 12 / 13 23 / 27
ITL 6 / 19 12 / 23 24 / 52

Token throughput:

Metric 1× B200 1× H100 1× A100
Output gen (tok/s) 4,806 2,472 1,049
Total (tok/s) 43,256 22,249 9,444

Latency (Mean / P99 in ms):

Metric 1× B200 1× H100 1× A100
TTFT 526 / 1,928 822 / 3,463 1,480 / 10,546
TPOT 6 / 7 12 / 13 29 / 30
ITL 6 / 58 12 / 113 29 / 130

Next steps

Ready to get started?

Create your Lambda Cloud account and launch NVIDIA GPU instances in minutes. Looking for long-term capacity? Talk to our team.