MLCommons Adds Reasoning, LLM, And Speech Tests To MLPerf Inference v5.1

0
82
Open Source MLPerf Inference v5.1 Delivers Major AI Performance Gains
Open Source MLPerf Inference v5.1 Delivers Major AI Performance Gains

MLCommons has released the MLPerf Inference v5.1 benchmarks, with record participation, new AI tests, and strong gains in system performance.

MLCommons has announced the release of the MLPerf Inference v5.1 benchmark results, the latest edition of its open-source, peer-reviewed suite for evaluating AI system performance across workloads. Designed to deliver architecture-neutral, reproducible benchmarking, the suite ensures fairness, transparency, and innovation in AI performance and energy efficiency.

This round marked a record with 27 organisations submitting systems, featuring five newly-available processors and updated AI software frameworks. In some cases, results outperformed the best from version 5.0, released just six months ago, by up to 50 per cent. The submissions also included the first-ever heterogeneous system using software to balance inference workloads across different accelerators.

The Llama 2 70B test remained the most widely used with 24 participants. Version 5.1 also introduced three new benchmarks: DeepSeek-R1, the first reasoning model in the suite; Llama 3.1 8B, a contemporary LLM with expanded context length; and Whisper Large V3, an open-source speech recognition model with multilingual capabilities.

Scott Wasson, Director of Product Management at MLCommons, said, “The pace of innovation in AI is breathtaking. The MLPerf Inference working group has aggressively built new benchmarks to keep pace with this progress. As a result, Inference 5.1 features several new benchmark tests, including DeepSeek-R1 with reasoning, and interactive scenarios with tighter latency requirements for some LLM-based tests. Meanwhile, the submitters to MLPerf Inference 5.1 yet again have produced results demonstrating substantial performance gains over prior rounds.”

Miro Hodak, MLPerf Inference working group co-chair, added, “Reasoning models are an emerging and important area for AI models, with their own unique pattern of processing. It’s important to have real data to understand how reasoning models perform on existing and new systems, and MLCommons is stepping up to provide that data. And it’s equally important to thoroughly stress-test the current systems so that we learn their limits; DeepSeek R1 increases the difficulty level of the benchmark suite, giving us new and valuable information.”

The results also highlighted five new accelerators benchmarked in this round: AMD Instinct MI355X, Intel Arc Pro B60 48GB Turbo, NVIDIA GB300, NVIDIA RTX 4000 Ada-PCIe-20GB, and NVIDIA RTX Pro 6000 Blackwell Server Edition.
Founded in 2018, MLCommons now counts over 125 members across academia, industry, and civil society, reinforcing its role as the global leader in open, transparent AI benchmarking.

LEAVE A REPLY

Please enter your comment!
Please enter your name here