AI safety researcher Dawn Song releases Agents' Last Exam, an agent benchmark where frontier models score under 2.6% · Digg