AI safety researcher Dawn Song releases Agents' Last Exam, an agent benchmark where frontier models score under 2.6%

VIEWS188KBOOKMARKS374LIKES749RETWEETS157REPLIES48

Dawn Song@dawnsongtweets

Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case?

Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work.

My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains.

With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations. The result is both impressive and sobering.

Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance.

On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here.

The age of truly job-ready agents is not.

We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵

1d188K749374

Noam Brown@polynoamial

I'm happy GPT-5.5 tops this eval

I'm even happier it's still doing the best when measured vs tokens, cost, or wall-clock time!

Dawn Song@dawnsongtweets

Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case?

Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work.

My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains.

With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations. The result is both impressive and sobering.

Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance.

On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here.

The age of truly job-ready agents is not.

We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵

1d84.9K713142

Philipp Schmid@_philschmid

The last benchmark for agents? Agents' Last Exam (ALE) evaluates agents on 1,000+ real world professional tasks across 55 industries, all sourced from actual expert work. Not synthetic. Not multiple choice. Real deliverables, graded deterministically.

Key findings: - Best agents score <50% on the easiest tier, <10% on the hardest - 82% on Terminal-Bench drops to 23% on ALE-CLI eval with the same setup - Hardest tier: most frontier agents hit 0% pass rate - Spending more tokens doesn't improve results - Each run tracks harness, model, pass rate, token usage, and cost

Harness vs. model: - Best harness scores 24.0%, worst scores 19.1% (same model). That's a 4.9pp gap. - Model choice drives more performance variation than the harness. - Most efficient setup used 160M tokens for 39.6%. Least efficient burned 1,373M tokens for 40.5%.

Where agents break (Agents often say "Done. All checks pass." while the output is wrong) - 47% of failures: wrong strategy or gave up early - 31%: missing domain knowledge - 22%: execution bugs and format errors - 34% of tasks need GUI software, agents avoid it and hack CLI workarounds

Very excited to see a benchmark like this. Big kudos to everyone who contributed.

22h3.7K6030

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

3d83.6K325225

Ramez Naam@ramez

We are a long, long way from superintelligence.

Zengyi Qin@qinzytech

Claude Fable 5 is still on the floor on Agents’ Last Exam (ALE)

Our hardest tier remains unsolved. Claude Fable 5 scores 0%, same as GPT-5.5

1d12.1K5921

Dawn Song@dawnsongtweets

ALE is built from real work, not synthetic tasks. Every task is derived from a real project that a human expert previously completed, and converted into a verifiable evaluation with objective grading.

No vibes. No human judges. Fully reproducible.

ALE spans 55 non-physical occupations, grounded in the O*NET / SOC 2018, the U.S. federal occupation taxonomy.

Built with 300+ experts from 100+ institutions across science, engineering, medicine, law, finance, education, and many other fields.

Dawn Song@dawnsongtweets

Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case?

Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work.

My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains.

With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations. The result is both impressive and sobering.

Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance.

On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here.

The age of truly job-ready agents is not.

We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵

1d5.5K7216

AK@_akhaliq

Agents' Last Exam

1d7.8K2912

Dawn Song@dawnsongtweets

In ALE, Fable 5 joins GPT-5.5 and Composer 2.5 in the same overall performance cluster.

But performance is only half the story.

Cost per task: → Fable 5: ~$15.70 → GPT-5.5: ~$3.80 → Composer 2.5: ~$1.33

At current pricing, Fable 5 delivers similar performance while costing roughly 4–12× more per completed task.

Dawn Song@dawnsongtweets

ALE-CLI is a CLI-only subset of ALE. Compared to Terminal-Bench and SWE-bench-Pro, it is broader, longer-horizon, and substantially more challenging:

• Broader. Tasks span 40 of ALE's 55 industry subdomains, compared to just 6 in Terminal-Bench and 5 in SWE-bench-Pro.

• Longer-horizon. Human completion times range from hours to weeks, rather than minutes to days.

• Harder. The best-performing agent achieves only a 25.2% pass rate, compared to 82.0% on Terminal-Bench and 59.1% on SWE-bench-Pro.

There's still a long way to go, and plenty of headroom left to climb. 📊👇

1d2.4K406

finbarr@finbarrtimbers

Very impressive showing by Cursor here

Dawn Song@dawnsongtweets

Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case?

Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work.

My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains.

With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations. The result is both impressive and sobering.

Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance.

On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here.

The age of truly job-ready agents is not.

We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵

1d6.7K444

Dawn Song@dawnsongtweets

The most common failure mode remains a familiar one: Agents declare success before they've truly verified their work.

A typical completion reads: "Done. All checks pass." Yet the output may be missing required files, contain incorrect counts, omit key fields, or violate explicit constraints in the task specification.

These failures occur far more often than many people expect. You can explore concrete examples in https://agents-last-exam.org/blogs/agent-showdown.

Dawn Song@dawnsongtweets

Why do ALE's results look different from some other benchmarks, especially for Fable 5?

Because there is no universally best agent.

Every frontier model, including Fable 5, has domains where it shines and domains where it struggles.

Aggregate scores average over 55 occupations and 1,500+ tasks, causing many models to cluster together. But the average is not the story.

The real signal lies in where agents succeed, where they fail, and how those patterns differ across domains. On identical tasks, different models often fail for very different reasons.

Explore the interactive breakdown in our blog → 👉 https://agents-last-exam.org/blogs/agent-showdown

1d2K275

Dawn Song@dawnsongtweets

How does ALE compare to existing agent benchmarks? Many of today's agent benchmarks are rapidly saturating as frontier systems improve.

ALE is designed to measure a different capability frontier: sustained, economically valuable work in real-world professional domains.

• 55 industry domains • 1,500+ expert-sourced tasks • Full GUI + CLI environments • Outcome-based, verifiable evaluation

If your agent only operates in the terminal, we've also released ALE-CLI: a CLI-only subset of the benchmark.

Dawn Song@dawnsongtweets

ALE is built from real work, not synthetic tasks. Every task is derived from a real project that a human expert previously completed, and converted into a verifiable evaluation with objective grading.

No vibes. No human judges. Fully reproducible.

ALE spans 55 non-physical occupations, grounded in the O*NET / SOC 2018, the U.S. federal occupation taxonomy.

Built with 300+ experts from 100+ institutions across science, engineering, medicine, law, finance, education, and many other fields.

1d3.4K413

Dawn Song@dawnsongtweets

Why do ALE's results look different from some other benchmarks, especially for Fable 5?

Because there is no universally best agent.

Every frontier model, including Fable 5, has domains where it shines and domains where it struggles.

Aggregate scores average over 55 occupations and 1,500+ tasks, causing many models to cluster together. But the average is not the story.

The real signal lies in where agents succeed, where they fail, and how those patterns differ across domains. On identical tasks, different models often fail for very different reasons.

Explore the interactive breakdown in our blog → 👉 https://agents-last-exam.org/blogs/agent-showdown

Dawn Song@dawnsongtweets

In ALE, Fable 5 joins GPT-5.5 and Composer 2.5 in the same overall performance cluster.

But performance is only half the story.

Cost per task: → Fable 5: ~$15.70 → GPT-5.5: ~$3.80 → Composer 2.5: ~$1.33

At current pricing, Fable 5 delivers similar performance while costing roughly 4–12× more per completed task.

1d2.7K333

AK@_akhaliq

paper: https://huggingface.co/papers/2606.05405

AK@_akhaliq

Agents' Last Exam

1d5.8K69

Yiyou Sun@YiyouSun

6/ Come test your agents on ALE → Website: http://agents-last-exam.org Task Samples: http://agents-last-exam.org/demo Paper: https://arxiv.org/abs/2606.05405 HuggingFace: https://huggingface.co/datasets/agents-last-exam/agents-last-exam Code: http://github.com/rdi-berkeley/agents-last-exam

3d1.3K195

Dawn Song@dawnsongtweets

ALE-CLI is a CLI-only subset of ALE. Compared to Terminal-Bench and SWE-bench-Pro, it is broader, longer-horizon, and substantially more challenging:

• Broader. Tasks span 40 of ALE's 55 industry subdomains, compared to just 6 in Terminal-Bench and 5 in SWE-bench-Pro.

• Longer-horizon. Human completion times range from hours to weeks, rather than minutes to days.

• Harder. The best-performing agent achieves only a 25.2% pass rate, compared to 82.0% on Terminal-Bench and 59.1% on SWE-bench-Pro.

There's still a long way to go, and plenty of headroom left to climb. 📊👇

Dawn Song@dawnsongtweets

How does ALE compare to existing agent benchmarks? Many of today's agent benchmarks are rapidly saturating as frontier systems improve.

ALE is designed to measure a different capability frontier: sustained, economically valuable work in real-world professional domains.

• 55 industry domains • 1,500+ expert-sourced tasks • Full GUI + CLI environments • Outcome-based, verifiable evaluation

If your agent only operates in the terminal, we've also released ALE-CLI: a CLI-only subset of the benchmark.

1d2.4K392

Dawn Song@dawnsongtweets

Why "Last Exam"? The name has two meanings: "Last" as the bar to clear：passing these exams means an agent can actually do the job and continue to deliver economically-valuable work in that profession.

"Last" as the frontier of difficulty：tasks are real, complex, long-horizon, and require professional expertise to execute. ALE sits right at the edge of what today's agents can reliably accomplish.

Come test your agent on ALE → Website: https://agents-last-exam.org Tasks: https://agents-last-exam.org/demo Leaderboard: https://agents-last-exam.org/leaderboard Paper: https://arxiv.org/abs/2606.05405 Dataset: https://huggingface.co/datasets/agents-last-exam/agents-last-exam Code: https://github.com/rdi-berkeley/agents-last-exam

Dawn Song@dawnsongtweets

The most common failure mode remains a familiar one: Agents declare success before they've truly verified their work.

A typical completion reads: "Done. All checks pass." Yet the output may be missing required files, contain incorrect counts, omit key fields, or violate explicit constraints in the task specification.

These failures occur far more often than many people expect. You can explore concrete examples in https://agents-last-exam.org/blogs/agent-showdown.

1d2K284

Peter Welinder@npew

Two takeaways for me: (1) GPT-5.5 is beast, and (2) lots of work left.

Dawn Song@dawnsongtweets

Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case?

Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work.

My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains.

With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations. The result is both impressive and sobering.

Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance.

On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here.

The age of truly job-ready agents is not.

We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵

13h2.4K154

Yiyou Sun@YiyouSun

1/ Where do the tasks come from?

Every task is a real project that a human expert has already shipped, turned into a code-graded test.

No vibes, no human judge, fully reproducible. Spanning 55 non-physical industries, grounded in O*NET / SOC 2018 (the U.S. federal occupational taxonomy).

Built by 300+ experts across 100+ institutions.

3d2.2K214

Dawn Song@dawnsongtweets

ALE is truly a community effort.

Huge thanks to a distinguished advisory committee guiding our industry landscape and task collection: @gallantlab, @thg_lab, Tarek Zohdi, Carl Boettiger & @ksteinfe (@UCBerkeley) Laure Zanna, @kaanozbay (@nyuniversity) George Em Karniadakis (@BrownUniversity) Tapio Schneider (@Caltech) @Idasim (@UCSF) Arvind Rao (@UMich) @yannakakis (@UMmalta) Patrick Bryant (@scilifelab) @yaminirangan (@HubSpot) @brad_rothenberg (@nTopology)

We are also deeply grateful to @BerkeleyRDI, RDI Foundation, @ChenInstitute, @UniPat_AI, @SnorkelAI (Open Benchmarks Grants program) for their support.

A huge thank you as well to our incredible organizing and execution team, and to all of the experts and contributors who donated their time, expertise, and real-world projects to make ALE possible.

This simply would not have happened without you.

Dawn Song@dawnsongtweets

Why "Last Exam"? The name has two meanings: "Last" as the bar to clear：passing these exams means an agent can actually do the job and continue to deliver economically-valuable work in that profession.

"Last" as the frontier of difficulty：tasks are real, complex, long-horizon, and require professional expertise to execute. ALE sits right at the edge of what today's agents can reliably accomplish.

Come test your agent on ALE → Website: https://agents-last-exam.org Tasks: https://agents-last-exam.org/demo Leaderboard: https://agents-last-exam.org/leaderboard Paper: https://arxiv.org/abs/2606.05405 Dataset: https://huggingface.co/datasets/agents-last-exam/agents-last-exam Code: https://github.com/rdi-berkeley/agents-last-exam

1d2.2K252

Daniel Jeffries@Dan_Jeffries1

TLDR:

Are agents 'job ready" soon? Or how about "recursively self improving"?

No.

Long version: In the novels of Iain M Banks and in Nueromancer, yes.

In reality, refer to TLDR answer and do not pass go.

Dawn Song@dawnsongtweets

Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case?

Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work.

My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains.

With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations. The result is both impressive and sobering.

Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance.

On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here.

The age of truly job-ready agents is not.

We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵

1d2.2K113