LLM evaluation platform Arena launches Agent Mode to benchmark GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro on multi-step tasks · Digg

/Tech8d ago

LLM evaluation platform Arena launches Agent Mode to benchmark GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro on multi-step tasks

The platform measures task success, steerability, and tool hallucination.

--0--

Original post

Anastasios Nikolas Angelopoulos#1147

Arena.ai@arena#370inTech

Introducing Agent Mode: Agentic AI is now measured in the Arena.

Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.

It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.

Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

9:00 AM · Jun 4, 2026 · 122.5K Views

Sentiment

Users are reacting to Arena's Agent Arena leaderboard, praising its credible AI evaluations while criticizing Grok's low ranking and associated bugs.

Pos

93.2%

Neg

6.8%

48 comments with sentiment.

Cluster Engagement

-

Views

-

Comments

-

Reposts

-

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS313.6KBOOKMARKS300LIKES1.1KRETWEETS135REPLIES62

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

Introducing Agent Mode: Agentic AI is now measured in the Arena.

Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.

It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.

Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

8d313.6K1.1K300

Lisan al Gaib@scaling01

this is brutal

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

8d101.8K529134

Anjney Midha@AnjneyMidha

Interesting

One of the hardest unsolved problems in frontier systems today is scalable evaluation of agent capabilities

This approach is SOTA, afaict

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

8d25.5K11475

Agentic AI is now evaluated in the Arena with Agent Mode and measured with Agent Arena.

Founding Engineer Matt and Product Lead Ted show you Agent Mode in action: deep research, complex bash operations, whatever you throw at it. Every session contributes to the Agent Arena leaderboard.

00:00 What is Agent Mode 00:16 The task: explain a research paper PDF 00:38 Watching the agent work 01:47 The workspace panel 02:13 Exploring the generated site 03:18 Voting on agent tasks 03:54 Follow-up: explain like I'm five 04:58 How voting feeds the leaderboard

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

7d8.4K8116

Anastasios Nikolas Angelopoulos@ml_angelopoulos

Agent Arena gives every model access to a Claude-Code-like harness and a computer. Our users went nuts, generating millions of real traces per week. We used this data to build the first large-scale benchmark of agent usefulness in the wild.

We analyze agents by collecting many axes of feedback, explicit and implicit, including: - Confirmed success: user marks task as success or failure. - Praise vs complaint: user praises or complains about agent output. - Steerability: agent responds correctly to user requests. - Bash recovery: time taken to recover from making an error in bash. - Tool hallucination: agent hallucinates tool that does not exist.

The longest tasks take multiple days and hundreds of turns, with nearly a thousand tool calls in a session (!), and give us a huge firehose of real-world agent traces to compute these signals. Our users are doing things like: - Building full-stack applications with backends and databases - Financial models involving market research pulled from the internet and .xlsx artifacts - Workflow automation, e.g. scraping all real-estate listings in an area and doing detailed data analysis on price as a function of parcel size and sqft - Deep research and scientific documents, pulling together .ppt presentations from careful research both from websites and academic publications

By meeting our users where they work, Agent Arena can speak to the boundary between the possible and impossible with different agents. The leaderboards we calculate are based on a novel causal inference approach that looks at each subcomponent of the agent (orchestrator and harness) as a treatment, and calculates treatment effects for each. Soon we will release more on the harness side, sharing what effect different harnesses have on agent capabilities.

@arena has gone far beyond a human preference benchmark and the voting mechanism. We are building signals of real post-deployment user value, and pushing the limits of evaluation.

If you are interested in shaping the future of evaluation as a collaborator or colleague, please reach out. We’d love to hear from you!

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

8d4.8K6312

Ion Stoica@istoica05

Super excited to launch Agent Mode on Arena. This is a huge milestone. Real agentic work has been hard to benchmark… until now. See how top frontier models handle multi-step workflows with search, bash, and file writing. Come break things, run deep research, and see who takes the crown.

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

8d7K906

Anastasios Nikolas Angelopoulos@ml_angelopoulos

In case you didn’t notice: Agent Arena doesn’t have a voting mechanism. So how do we calculate the scores?

The answer is causal inference. Agents are multi-stage systems where the orchestrator and harness work together to produce the end result. We developed a method called causal tracing that looks at each possible orchestrator and harness component as a treatment, and evaluate the treatment effect with respect to a randomized baseline on all the signals mined from traces. This allows us to independently evaluate each subcomponent, track how the effects change as new options are added, and combine many signals into one coherent leaderboard.

The leaderboard you see is the net effect of the orchestrator as a treatment when looking across a basket of implicit and explicit success signals, including: - Confirmed success: user marks task as success or failure. - User affirmation: user praises or complains about agent output. - Steerability: agent responds correctly to user requests. - Bash recovery: time taken to recover from making an error in bash. - Tool hallucination: agent hallucinates tool that does not exist.

Human preference is now only one of the many signals that Arena can measure. All signals based on real-world usage by a huge population of 10s of M of users.

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

4d7.3K4512

As we launch Agent Mode on Arena today, we want to celebrate the community that brought us here.

Battle Mode - where it all started - just passed 50 million votes.

Thank you.

Introducing Agent Mode: Agentic AI is now measured in the Arena.

Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.

It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.

Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

8d7.8K776

Agent performance is not one-dimensional.

The aggregate ranking combines multiple signals: task success, user praise vs. complaints, steerability, bash recovery, and tool hallucination.

Top models win in different ways: some complete tasks more reliably, some recover better from errors, and some are easier for users to steer.

8d3.2K487

What are people actually using agents for?

We analyzed the task distribution in Agent Arena across a 7-day window: 160K real user tasks spanning coding, debugging, research, document creation, frontend development, file analysis, and long multi-step workflows.

The largest categories were: - Code writing (17.5%) - Research and lookup (10.8%) - Planning and brainstorming (10.6%) - Multimodal image/video work (10.2%) - Document creation (9.1%) - Code debugging (8.9%)

Agent usage is broad: it’s not just coding, but research, planning, content creation, file work, and complex workflows that combine multiple tools over many turns.

8d5.2K546

benahorowitz.eth@bhorowitz

Congrats to @arena on Agent Mode!!

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

8d17.8K405

Vasek Mlejnsky@mlejva

The Arena team has been cooking

Introducing Agent Mode: Agentic AI is now measured in the Arena.

Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.

It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.

Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

8d5.4K266

MTS@MTSlive

"Reality is the only benchmark that can't be gamed."

We asked @ml_angelopoulos, CEO of @arena, what three years of running the world's largest AI evaluation platform has taught him.

"People were using MMLU, and the models were good on multiple-choice questions, but you put them in the hands of people, and one of them sh-ts the bed completely. It's not the one that did well on the test."

"It's just like people. You put them out in reality and they'll fall on their face, plenty of people who did well in school that will not do that well in real life."

"Lots of models are that way. You put it in front of users, you measure how it does, you look at the post-deployment performance."

Introducing Agent Mode: Agentic AI is now measured in the Arena.

Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.

It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.

Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

8d4K313

The full Agent Arena Leaderboard is here: http://arena.ai/leaderboard/agent

8d2.9K173

Models on the cost-performance Pareto frontier: - GPT-5.5 (High) - Claude-Opus-4.7 (Thinking) - GPT-5.4 (High) - GPT-5.5 - Claude-Sonnet-4.6 - GLM-5.1 - Qwen-3.6-Plus - DeepSeek-V4-Flash

Higher-cost models generally deliver stronger agentic performance, but not always. Agent Arena helps measure the trade-off: which models are strongest, which are most efficient, and track how the frontier is moving.

8d2.1K283

prealpha@inprealpha

The madmen did it. Holy grail of agent eval!

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

8d1.7K84

MTS@MTSlive

.@ml_angelopoulos of @arena says the intense competition between frontier models has been good for everyone.

"The competition has been very healthy, this space is a flagship example of how the competitive dynamics of capitalism are really good for consumers."

"It's all out war between OpenAI, Anthropic, Google, xAI. But it's really good because they can't move slow or they die. They can't underpay their employees 'cause they leave."

"You get all-star people from all walks of life congregating at these companies and creating amazing models... small decisions magnify."

"For some types of people, hiring them gets escalated to a business decision at the top levels of the company, 'We hire this person, we get 1,000 great researchers because of it.'"

"Those people are getting hundreds of millions in comp, and deservedly so, their BATNA is, 'I'm gonna start my own company and it's gonna be a unicorn from day one.'"

Introducing Agent Mode: Agentic AI is now measured in the Arena.

Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.

It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.

Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

8d3.5K182

Check out our technical blog for the Agent Arena methodology + a deep dive into how people delegate, correct, and steer agents: https://arena.ai/blog/agent-arena-methodology

8d4K172

Che-Ping Tsai@chepingt

Belated career update: I graduated from CMU MLD and have been working at @Arena on AI evaluation.

Fair and principled measurement has long been at the heart of my PhD research, and I’m excited to continue pursuing this direction in the context of agentic AI. Grateful to be part of the team at Arena.

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

8d841180

Aryan Vichare@aryanvichare10

We're excited to release Agent Mode today!

Agent Mode measures agents on real user tasks in the real world – deep research, slideshow generation, code generation, and many more.

From millions of real-world user traces, we're able to construct the world's first online and large-scale leaderboard for how useful agents actually are in the wild (link in thread).

The team at @arena cooked with this release

Introducing Agent Mode: Agentic AI is now measured in the Arena.

Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.

It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.

Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

8d1.6K160