LLM performance up 15.4%: MLPerf v5.1 confirms NVIDIA HGX B200 is ready for enterprise inference

Inference at scale is still too slow. Large models often stall under real-world load, burning time, compute and user trust. That’s the problem we set out to solve.
Our MLPerf Inference v5.1 results show Lambda's 1-Click Clusters powered by NVIDIA HGX B200 delivered up to 15.4% performance gains over the previous round's best results. These benchmarks highlight how our 1-Click Clusters unlock best-in-class inference performance for enterprise production workloads.
Quick Insights
- We tested three models across six scenarios each: Llama 2 70B, Llama 3.1 405B and Stable Diffusion XL.
- Each model was benchmarked in both the Offline and Server scenarios.
- Five of six submissions delivered higher performance than the prior rounds, with gains ranging from +1.6% to +15.4%.
- Every submission met MLCommons' MLPerf accuracy and compliance checks for their respective tasks.
- All benchmarks were run on an 8x NVIDIA HGX B200 (180 GB VRAM per GPU), using NVIDIA TensorRT 10.11 on NVIDIA CUDA 12.9 and Ubuntu 24.04.
Results at a glance
The table below compares our v5.1 results against v5.0’s published numbers for the same models and scenarios from the prior round.
Model |
Scenario |
v5.1 Result |
v5.0 Best |
Δ vs v5.0 |
llama2-70b-99 |
Offline |
102725.00 Tokens/s |
98858.00 Tokens/s |
+3.9% |
llama2-70b-99 |
Server |
99993.90 Tokens/s |
98443.30 Tokens/s |
+1.6% |
llama3.1-405b |
Offline |
1648.60 Tokens/s |
1538.17 Tokens/s |
+7.2% |
llama3.1-405b |
Server |
1246.79 Tokens/s |
1080.31 Tokens/s |
+15.4% |
stable-diffusion-xl |
Offline |
32.57 Samples/s |
30.38 Samples/s |
+7.2% |
stable-diffusion-xl |
Server |
28.46 Queries/s |
28.92 Queries/s |
-1.6% |
- Llama 3.1 405B showed the largest Server-side gain (+15.4%).
- On 8xB200, both Llama 3.1 405B and Llama 2 70B delivered 3-4x higher performance than the best 8×H200 results.
- Stable Diffusion XL (SDXL) improved +7.2% in offline throughput. Server dipped -1.6% vs. the prior best but still matched the median and SDXL was 1.07x faster than the best 8xH200 results.
- The latest NVIDIA silicon combined with software produced a 1.6-15.4% gains across models with NVIDIA’s updated stack: TensorRT 10.11 and CUDA 12.9 with improved FP4 support.
System Under Test (SUT)
MLPerf benchmarks were evaluated in two key inference scenarios:
- Offline: measures peak throughput at saturation, relevant for batch jobs and bulk generation.
- Server: enforces strict latency caps under load, simulating real-world serving conditions. Gains here directly improve user experience for real-time applications.
Percentage gains are reported as (v5.1 ÷ prior best – 1) × 100.
To ensure consistency, all benchmarks were run on the same system with consistent configuration:
- System: NVIDIA HGX B200 (8xB200-180GB, TensorRT)
- Nodes: 1
- GPUs per node: 8
- GPU model: NVIDIA HGX B200-180GB
- Host CPU: Dual Intel Xeon Platinum 8750 with 56 cores
- Framework: TensorRT 10.11, CUDA 12.9
Model specifications
Llama 3.1 405B and Llama 2 70B-
- Precision: FP4 weights
- Scenarios: Offline and Server
- Metrics: Tokens per second under MLPerf accuracy and Server latency constraints
-
- Precision: UNet fp8, CLIP encoders fp32, VAE FP32
- Scenarios: Offline and Server
- Metrics: Samples per second in Offline, Queries per second in Server
MLPerf Inference v5.1: our testbed
We partnered with NVIDIA to build custom TensorRT engines for each model and tuned per-scenario runtime parameters on NVIDIA HGX B200. The main areas of focus for testing were:
- Calibrating batch size, request count and concurrency to meet Server latency targets while maximizing throughput.
- Verifying streaming tokenization and scheduler settings for LLM Server runs to prevent tail-latency spikes under high QPS.
- Validating engine tactics on TensorRT 10.11 with CUDA 12.9, confirming NVIDIA NVLink bandwidth and memory configuration on HGX B200.
- For SDXL, evaluating UNet pipeline tiling and host-device staging effects at FP8, while keeping CLIP and VAE in FP32 for accuracy and safety.
What this means for Enterprise AI
Lambda builds supercomputers in the cloud.
Across six MLPerf submissions, Lambda was placed first or second in the categories, competing among a dozen top vendors. This validates our infrastructure tuning and reinforces production-readiness for enterprise workloads.
Our 1-Click Clusters scale from 16 to 1,536 GPUs with flexible rental terms from weekly to multi-year reservations. No contracts for POCs, with clear scaling paths to production through managed Kubernetes or Slurm orchestration.
For enterprises validating AI use cases before scaling to thousands of users, these benchmarks demonstrate production-ready infrastructure.
For startups iterating on model development, they show the compute performance available for short-term bursts without long-term commitments.
Lambda GPU Cloud is ready. Is your stack?
Lambda's 1-Click Clusters are ready for enterprise inference workloads. The question is, can your current infrastructure keep up?
Try it for yourself.
Deploy a 1-Click Cluster with 16-1536+ NVIDIA HGX B200s and run your own benchmarks.