Salesforce launches MCP-Universe, an open source benchmark revealing that even top LLMs like GPT-5 fail in more than half of real-world enterprise tasks, underscoring the need for better frameworks and tooling.
Salesforce AI Research has released MCP-Universe, an open source benchmark designed to test how large language models (LLMs) perform when interacting with Model Context Protocol (MCP) servers in real-world scenarios. The results highlight a significant gap between lab-based benchmarks and enterprise-grade orchestration tasks.
In its first round of testing, Salesforce found that even frontier models such as OpenAI’s GPT-5 failed at more than half of the enterprise tasks evaluated. GPT-5 still outperformed rivals overall, especially in financial analysis, while Grok-4 led in browser automation and Claude-4.0 Sonnet placed third. Among open-source contenders, GLM-4.5 delivered the strongest results.
The benchmark revealed two major challenges:
- Long context handling – models lose consistency when working with extended or complex inputs.
- Unknown tool adaptation – models struggle to use unfamiliar systems as flexibly as humans.
MCP-Universe was built to reflect six enterprise domains—location navigation, repository management, financial analysis, 3D design, browser automation and web search—across 11 MCP servers with 231 tasks. Unlike conventional benchmarks, it uses execution-based evaluation through format, static and dynamic checks, offering a more realistic measure of performance with live data.
The open source benchmark builds on Salesforce’s earlier MCPEvals and rivals efforts such as MCP-Radar and MCPWorld. Its open release allows enterprises, developers and researchers to measure, compare and improve LLM reliability in environments that mirror real-world use cases.
Salesforce emphasises that current frontier models are not enterprise-ready for orchestration at scale. By making MCP-Universe openly available, it provides a transparent, extensible framework to guide enterprises toward stronger frameworks, tooling and multi-model strategies, rather than relying on a single model for complex tasks.














































































