LLM Performance Benchmarks Leaderboard

Providing a clear, data-driven comparison of today's leading large language models. We present standardized benchmark results for top contenders like Meta's Llama 4 series, Alibaba's Qwen3, and the latest from DeepSeek, focusing on critical performance metrics that measure everything from coding ability to general knowledge.

Use Inference API

Frequently asked questions

What is the best open-source LLM?

If your primary goal is coding and software development, the benchmark data suggests that Qwen3-235B-A22B is a top performer, scoring an impressive 69.5% on LiveCodeBench. For tasks requiring strong general knowledge and reasoning, the Qwen3 models also lead, with Qwen3-235B-A22B achieving 80.6% on the MMLU Pro benchmark.

However, if you are looking for a more balanced or efficient model, DeepSeek-R1-Distill-Llama-70B offers very competitive performance across the board (51.8% on LiveCodeBench, 71.2% on MMLU Pro) and may be less resource-intensive than the largest models. We recommend using the table above to weigh the performance on the benchmarks that matter most to your project.

How are these benchmarks run?

To ensure a fair and accurate comparison, we have created 2 benchmark variations:

Standardized: These are benchmarks that are run using the same input parameters for each model. This allows us to get a fair comparison between different models, in a normalized manner.
Optimized: These benchmarks use a set of parameters that give the best set of results for a particular model (this can vary from model to model). This allows us to reproduce benchmark scores provided by model providers.

If you are interested in the parameters used, here is the Github Readme.

What do the different model names mean (e.g., -70B)?

The letters and numbers in model names, like -70B or -A22B, are identifiers that typically denote key architectural details. The 'B' almost always stands for billions of parameters (e.g., -70B means the model has 70 billion parameters). A higher number of parameters generally leads to better performance but also requires more computational resources. Other suffixes, like -A22B or -Instruct, often refer to specific versions, instruction-tuning methods, or data precision formats (like FP8) that can affect the model's performance and efficiency.

LLM Performance Benchmarks Leaderboard

LLM Leaderboard: At-a-Glance Comparison

Frequently asked questions

What is the best open-source LLM?

How are these benchmarks run?

What do the different model names mean (e.g., -70B)?