AI dun goofed
businessinsider.com
AI agent did only 24% of the work it was told to do.
He was supposed to analyze a database, analyze contents of a discussion between human beings and make notes of it.
The 24% AI was the best! Others were even worse.
Contestants are: Claude (24), Gemini, Flash and GPT-4o (OpenAI)
Researchers now say this does not fully show the capabilites of different AIs. They did not fully understand what they were supposed to do, if they did, they would have done it. Claude has the best capability to understand a human language so thats why he did best. The others are just as good if you manage to get them understand the exact situation.