LLM Performance Benchmarks Leaderboard
LLM Leaderboard: At-a-Glance Comparison
Frequently asked questions
What is the best open-source LLM?
If your primary goal is coding and software development, the benchmark data suggests that Qwen3-235B-A22B is a top performer, scoring an impressive 69.5% on LiveCodeBench. For tasks requiring strong general knowledge and reasoning, the Qwen3 models also lead, with Qwen3-235B-A22B achieving 80.6% on the MMLU Pro benchmark.
However, if you are looking for a more balanced or efficient model, DeepSeek-R1-Distill-Llama-70B offers very competitive performance across the board (51.8% on LiveCodeBench, 71.2% on MMLU Pro) and may be less resource-intensive than the largest models. We recommend using the table above to weigh the performance on the benchmarks that matter most to your project.
How are these benchmarks run?
- Standardized: These are benchmarks that are run using the same input parameters for each model. This allows us to get a fair comparison between different models, in a normalized manner.
- Optimized: These benchmarks use a set of parameters that give the best set of results for a particular model (this can vary from model to model). This allows us to reproduce benchmark scores provided by model providers.
If you are interested in the parameters used, here is the Github Readme.