• Log in | Sign up
    link
    fedilink
    English
    6
    edit-2
    11 hours ago

    Wow. 30% accuracy was the high score!
    From the article:

    Testing agents at the office

    For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

    They call it TheAgentCompany. It’s a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

    the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

    ⚫ Gemini-2.5-Pro (30.3 percent)
    ⚫ Claude-3.7-Sonnet (26.3 percent)
    ⚫ Claude-3.5-Sonnet (24 percent)
    ⚫ Gemini-2.0-Flash (11.4 percent)
    ⚫ GPT-4o (8.6 percent)
    ⚫ o3-mini (4.0 percent)
    ⚫ Gemini-1.5-Pro (3.4 percent)
    ⚫ Amazon-Nova-Pro-v1 (1.7 percent)
    ⚫ Llama-3.1-405b (7.4 percent)
    ⚫ Llama-3.3-70b (6.9 percent),
    ⚫ Qwen-2.5-72b (5.7 percent),
    ⚫ Llama-3.1-70b (1.7 percent)
    ⚫ Qwen-2-72b (1.1 percent).

    “We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks,” the authors state in their paper

    • @[email protected]
      link
      fedilink
      English
      -15 hours ago

      sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right

      • @[email protected]
        link
        fedilink
        English
        14 hours ago

        Are you arguing they should have built a test that makes AI perform better? How are you offended on behalf of AI?