How to deploy Step 3.7 Flash on Lambda

TL;DR: token throughput

vLLM with MTP speculative decoding, workload 8192 in / 2048 out tokens at 32 concurrent requests.

Hardware Output gen Per-user gen Total throughput TTFT (mean) ITL (mean)
4× NVIDIA B200 GPUs 1,597 tok/s ~53 tok/s 7,982 tok/s 2,183 ms 52.7 ms
8× NVIDIA H100 GPUs 1,245 tok/s ~42 tok/s 6,224 tok/s 2,995 ms 67.2 ms
8× NVIDIA A100 GPUs 474 tok/s ~17 tok/s 2,370 tok/s 13,646 ms 167.5 ms

Benchmark command

(8192 in / 2048 out tokens, 32 parallel requests, 512 prompts.)

vllm bench serve \
  --backend openai-chat \
  --model stepfun-ai/Step-3.7-Flash \
  --served-model-name step3p7-flash \
  --dataset-name random \
  --random-input-len 8192 \
  --random-output-len 2048 \
  --num-prompts 512 \
  --max-concurrency 32 \
  --endpoint /v1/chat/completions

See Benchmarking Step 3.7 Flash for the full results.

Background

Step 3.7 Flash is StepFun AI's 198-billion-parameter vision-language model, released open-weight under Apache 2.0. The sparse Mixture-of-Experts design activates only about 11 billion parameters per token, reads both text and images, handles a 256K-token context, and lets you set reasoning effort to low, medium, or high.

The language model carries over from Step 3.5 Flash almost unchanged, including the Multi-Token Prediction heads that vLLM uses for speculative decoding and that account for much of the throughput below. The real change in 3.7 is vision. StepFun added a 1.8-billion-parameter Perception Encoder that takes images directly and trained the model to work with them interactively, so it can crop or zoom into a region and mark it up from a short Python snippet before answering.

On StepFun's own benchmarks, the model is strongest at agentic and tool-use work and at image-grounded questions, scoring 67.1 on ClawEval 1.1 and 79.2 on SimpleVQA-Search (both first in their tests), 95.3 on V*, and 56.3 on SWE-Bench Pro. These are the vendor's numbers, so treat them as a starting point until third parties reproduce them.

Model specifications

Overview

  • Name: Step 3.7 Flash
  • Author: StepFun AI
  • Architecture: Sparse MoE vision-language model (step3p7; language backbone step3p5)
  • License: Apache 2.0

Specifications

  • Total parameters: 198B, 196B language backbone + 1.8B Perception Encoder vision tower (~11B active per token)
  • Experts: 288 routed + 1 shared, top-8, sigmoid router
  • Layers / hidden size: 45 / 4096
  • Attention: S3F1 hybrid — 3× sliding-window (W=512) + 1× full GQA-8
  • Context window: 256K
  • Modality: image-text-to-text (vision-language)
  • Reasoning modes: low / medium / high

Hardware requirements

  • Minimal deployment:
    • 4× NVIDIA B200 GPUs (--tp-size 4)
    • 8× NVIDIA H100 GPUs (--tp-size 8)
    • 8× NVIDIA A100 GPUs (--tp-size 8)

Deployment and benchmarking

Deploying Step 3.7 Flash

Serve Step 3.7 Flash with vLLM, matching the tensor-parallel size to the number of GPUs (4 for B200, 8 for H100 or A100).

  1. Launch a 4× NVIDIA B200 GPUs, 8× NVIDIA H100 GPUs, or 8× NVIDIA A100 GPUs instance from the Lambda Cloud Console on the GPU Base 24.04 image.
  2. Connect over SSH or the JupyterLab terminal (see Connecting to an instance).
  3. Start the server:

vLLM

docker run -d --gpus all \
    --ipc=host -p 8000:8000 \
    -e HF_HOME=/root/.cache/huggingface \
    -e VLLM_USE_DEEP_GEMM=0 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vllm/vllm-openai:stepfun37 \
    --model stepfun-ai/Step-3.7-Flash \
    --served-model-name step3p7-flash \
    --tensor-parallel-size 4 \
    --host 0.0.0.0 --port 8000 \
    --max-model-len auto \
    --trust-remote-code \
    --gpu-memory-utilization 0.9 \
    --enable-expert-parallel \
    --disable-cascade-attn \
    --reasoning-parser step3p5 \
    --enable-auto-tool-choice \
    --tool-call-parser step3p5 \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

A few flags are specific to this model:

  • --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' turns on its built-in Multi-Token Prediction heads for speculative decoding, which is most of where the throughput above comes from.
  • --reasoning-parser step3p5 and --tool-call-parser step3p5 (with --enable-auto-tool-choice) handle the model's reasoning output and OpenAI-style tool calls.
  • --disable-cascade-attn is required for the S3F1 hybrid attention, and --enable-expert-parallel spreads the 288 experts across the GPUs.

Verify the server

curl -X GET http://localhost:8000/v1/models \
  -H "Content-Type: application/json"

You should see step3p7-flash listed in the response.

Benchmarking Step 3.7 Flash

Workload: 8192 input / 2048 output tokens, 512 prompts, 32 concurrent requests, vLLM with MTP speculative decoding.

H100 and A100 figures were captured from a live vLLM server running in eager mode (CUDA graphs disabled). Throughput with CUDA graphs enabled is expected to be higher.

Token throughput:

Metric 4× NVIDIA B200 GPUs 8× NVIDIA H100 GPUs 8× NVIDIA A100 GPUs
Output generation (tok/s) 1,597 1,245 474
Total, input & output (tok/s) 7,982 6,224 2,370
Per-user generation (tok/s) ~53 ~42 ~17

Latency (Mean / P99 in ms):

Metric 4× NVIDIA B200 GPUs 8× NVIDIA H100 GPUs 8× NVIDIA A100 GPUs
Time to first token 2,183 / 25,629 2,995 / 35,018 13,646 / 99,982
Time per output token 18.8 / 26.6 24.0 / 33.9 60.1 / 84.2
Inter-token latency 52.7 / 182.3 67.2 / 250.3 167.5 / 215.7

Next steps

Upstream

Downstream

Use as a Claude Code backend

Use your self-hosted Step 3.7 Flash instead of Anthropic's API for local development. Replace <NODE_IP> with the IP of the node running the vLLM server:

export ANTHROPIC_BASE_URL="http://<NODE_IP>:8000"
export ANTHROPIC_API_KEY="dummy"

export ANTHROPIC_MODEL="step3p7-flash"
export ANTHROPIC_DEFAULT_SONNET_MODEL="step3p7-flash"
export ANTHROPIC_DEFAULT_OPUS_MODEL="step3p7-flash"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="step3p7-flash"

export DISABLE_TELEMETRY=1

claude

Ready to get started?

Create your Lambda Cloud account and launch NVIDIA GPU instances in minutes. Looking for long-term capacity? Talk to our team.