/smstreet/media/media_files/Cv2bFACZ7nhmwvdahBPE.jpg)
This quarter, Salesforce AI Research unveiled an enterprise simulation environment to test agents ability to perform in realistic business scenarios, supported the launch of a new benchmarking tool to measure agents across enterprise use cases, and enhanced Data Cloud with advanced consolidation capabilities that leverage small and large language models to autonomously unify data. From impr oving data quality to setting new standards in measuring agentic performance, these innovations are fueling product breakthroughs that tackle today’s most pressing challenges for CIOs and IT leaders, giving businesses the trust and tools they need to evolve into agentic enterprises –organizations that embrace digital labor and use AI to work alongside humans.
Simulating Enterprise Environments with CRMArena-Pro
Pilots don't learn to fly in a storm; they train in flight simulators that push them to prepare in the most extreme challenges. Similarly, surgeons test their skills in high-risk procedures on synthetic models and cadavers before ever operating on a human, and athletes perfect their plays in drills and scrimmages ahead of a big game. In every high-stakes field, skills and consistencies are honed not through live action, but through deliberate preparation in a space where failure is a learning tool, not a costly mistake.
AI agents benefit from simulation testing and training too, preparing them to handle the unpredictability of daily business scenarios in advance of their deployment. Building on the original CRMArena, which focused on single-turn B2C service tasks, Salesforce AI Research has now unveiled CRMArena-Pro which tests agent performance in complex, multi-turn, multi-agent scenarios such as sales forecasting, service case triage, and CPQ processes. By using synthetic data, enabling safe API calls to relevant systems, and enforcing strict safeguards to protect PII, CRMArena-Pro creates a rigorous, context-rich simulated enterprise environment framework to test not only whether an agent works, but whether it can operate accurately, efficiently, and consistently at scale across enterprise-specific use cases.
Acting much like a digital twin or metaverse of a business, these environments go beyond simple test beds, capturing the full complexity of enterprise operations. Salesforce AI Research is advancing AI agent training with these simulations, enabling businesses to test agents in situations like customer service escalations or supply chain disruptions, before ever going live. By incorporating real-world “noise,” enterprises can better evaluate performance, strengthen resilience against edge cases, and bridge the gap between training and live operations, resulting in AI agents that are not only capable, but consistent, trustworthy, and agentic enterprise-ready.
Measuring Agent Readiness with the Agentic Benchmark for CRM
With new models and updates emerging daily, enterprises face a growing dilemma of which model, or combination of models, is best suited to help power agents in real-world business settings. The answer can’t come from hype cycles or raw size alone; it requires a rigorous way to measure how agents perform within specific business workflows.
Salesforce introduced the new Agentic Benchmark for CRM, the first benchmarking tool designed to evaluate AI agents not on generic capabilities, but in the contexts that matter most to businesses, including customer service, field service, marketing, and sales. The benchmark measures agents across five essential enterprise metrics—accuracy, cost, speed, trust and safety, and sustainability—offering a comprehensive, data-driven assessment of their readiness for real-world deployment.
Sustainability, the newest metric in the agentic measurement tool, is a key marker of an agent’s enterprise readiness. This measure highlights the relative environmental impact of AI systems, which can demand significant computational resources. By aligning model size with the specific level of intelligence required to complete an enterprise-specific task, businesses can minimize their footprint and determine their AI sustainability, all while achieving the caliber of performance they need. By cutting through model overload, the benchmark gives businesses a clear, data-driven way to pair the right models with the right agents, ensuring consistent, trustworthy, and enterprise-grade performance.
MCP-Eval and MCP-Universe are two additional complementary benchmarks published by Salesforce AI Research this quarter designed to measure agents at different levels of rigor and tracks LLMs as they interact with MCP servers in the real world use case environments. MCP-Eval provides scalable, automatic evaluation through synthetic tasks, making it well-suited for testing across a wide range of MCP servers. MCP-Universe, on the other hand, introduces challenging real-world tasks with execution-based evaluators that stress-test agents in complex scenarios and offers an extendable framework for building and evaluating agents. Together, they form a powerful toolkit: MCP-Eval for broad, initial assessments, and MCP-Universe for deeper diagnosis and debugging.
This dual approach is especially critical for enterprises, as the research found most state-of-the-art LLMs on the market today still face key limitations that hold them back from enterprise-grade performance — from long-context challenges, where models lose track of information in complex inputs, to unknown-tool challenges, where they fail to adapt seamlessly to unfamiliar systems. By leveraging MCP-Universe and MCP-Eval, enterprises can gain a clear view of where agents break down and refine their frameworks or tool integrations accordingly. And with a platform that layers in context, enhanced reasoning, and trust guardrails, organizations can move beyond DIY experimentation to deliver agents ready for real-world business impact.
Consolidating Data with Account Matching
At the heart of reliable, scalable AI agent performance is high-quality, unified data that enables context-aware, accurate, and compliant decision-making. Unified data allows agents to understand context, follow business rules, and make decisions that align with organizational goals. However, this has long been a challenge for businesses, as enterprise data is rarely clean or well-organized. Customer records are often duplicated across departments, fields are incomplete, and inconsistent formatting and naming conventions make it difficult to reconcile data across systems.
To tackle this, the Salesforce AI Research and product teams partnered to fine-tune large and small language models and power Account Matching, a capability that autonomously identifies and unifies accounts across scattered, inconsistent datasets. Instead of treating “The Example Company, Inc.” and “Example Co.” as separate entities, the system can now use AI to recognize them as the same and consolidate them into a single, authoritative record. Unlike static, rule-based systems that require heavy manual setup, Account Matching reconciles millions of records in real time with measurable accuracy improvements.
These are the kinds of breakthroughs driving real ROI for customers today. In just the first month, one customer’s proprietary tool that utilizes Account Matching unified more than a million accounts with a 95% match success rate, reducing average handling time by 30 minutes. Powered by fine-tuned small and large language models and identity resolution rules (account name plus website, address, or phone number), the tool automatically matches accounts across business units, surfaces a flow in each org for sellers to connect, and routes only the top 5% of complex cases to humans. By helping sellers quickly find counterparts covering the same or similar accounts, the solution eliminates duplicative work, accelerates sales cycles, and prevents missed opportunities. Best of all, the entire solution was implemented without the need for hard coding, lowering costs and dramatically improving efficiency.
With Account Matching, businesses have access to clean, unified data that powers AI agents with confidence, enabling smarter automation, richer personalization, and faster decisions at scale.