CMU Researchers Fully Staffed a Virtual Company With Current AI Agents - It Didn't Go Well.
- Jamie Ryding
- May 4
- 2 min read

As an experiment, Carnegie Mellon researchers used only AI agents based on current models (including those from Google, OpenAI, Anthropic and Meta) to fulfill tasks at a virtual software company to see how well they collectively handled business tasks in finance, software development, project management and HR. Although accurate at short tasks, the agents struggled to complete complex tasks.Â
One example is an AI agent that couldn't find the right individual on a company chat, so simply renamed another agent to the name of the person it was looking for. Claude 3.5 Sonnet seemed to perform the best, with Amazon's Nova Pro v1 as the worst.
The short article from Futurism is here:
The complementary viewpoint though is the pace at which models are improving in ability to handle complex tasks. The organization METR has developed a metric that uses the length of time that a human takes to complete a task (as a measure of complexity) combined with a success rate for an AI agent of 50%. METR estimates that this task length/complexity at that 50% success rate has been doubling for generalist AIs every seven months. So, AI agents will inevitably get better at business operations.
My recent experience is that LLMs really do help with the legwork of researching topics in the scientific literature but still require curation effort before and after the result. We can already see that specialized systems sitting on top of that basic capability will help make that interaction easier and more streamlined and let us spend more time innovating and making connections between existing information. Interesting times for sure.As an experiment, Carnegie Mellon researchers used only AI agents based on current models (including those from Google, OpenAI, Anthropic and Meta) to fulfill tasks at a virtual software company to see how well they collectively handled business tasks in finance, software development, project management and HR. Although accurate at short tasks, the agents struggled to complete complex tasks.Â
One example is an AI agent that couldn't find the right individual on a company chat, so simply renamed another agent to the name of the person it was looking for. Claude 3.5 Sonnet seemed to perform the best, with Amazon's Nova Pro v1 as the worst.
The short article from Futurism is here:
The complementary viewpoint though is the pace at which models are improving in ability to handle complex tasks. The organization METR has developed a metric that uses the length of time that a human takes to complete a task (as a measure of complexity) combined with a success rate for an AI agent of 50%. METR estimates that this task length/complexity at that 50% success rate has been doubling for generalist AIs every seven months. So, AI agents will inevitably get better at business operations.
My recent experience is that LLMs really do help with the legwork of researching topics in the scientific literature but still require curation effort before and after the result. We can already see that specialized systems sitting on top of that basic capability will help make that interaction easier and more streamlined and let us spend more time innovating and making connections between existing information. Interesting times for sure.