Two years ago, in a project called the Beyond the Imitation Game benchmark, or BIG-bench, 450 researchers compiled a list of 204 tasks designed to test the capabilities of large language models, which power chatbots like ChatGPT. On most tasks, performance improved predictably and smoothly as the models scaled up—the larger the model, the better it got. But with other tasks, the jump in ability wasn’t smooth. The performance remained near zero for a while, then performance jumped.
But the Stanford researchers point out that the LLMs were judged only on accuracy: Either they could do it perfectly, or they couldn’t. So even if an LLM predicted most of the digits correctly, it failed. That didn’t seem right. If you’re calculating 100 plus 278, then 376 seems like a much more accurate answer than, say, −9.34. So instead, Koyejo and his collaborators tested the same task using a metric that awards partial credit.
Ai Ai Latest News, Ai Ai Headlines
Similar News:You can also read news stories similar to this one that we have collected from other news sources.
Source: ScienceNews - 🏆 286. / 63 Read more »
Source: ForbesTech - 🏆 318. / 59 Read more »
Source: IntEngineering - 🏆 287. / 63 Read more »