A test so difficult that no AI system can pass – yet

If you’re looking for a new reason to worry about artificial intelligence, try this: some of the smartest people in the world are trying to create tests that AI systems can’t pass.

For years, AI systems were measured by giving new models a series of standardized benchmark tests. Many of these tests consisted of challenging, SAT-caliber problems in areas such as math, science, and logic. Comparing the results of the models over time served as a rough measure of AI progress.

But the AI ​​systems eventually got too good at those tests, so new, harder tests were created — often with the kinds of questions graduate students might encounter on their exams.

Even these tests are not in good condition. New models from companies like OpenAI, Google, and Anthropic have scored high on many Ph.D.-level challenges, limiting the usefulness of these tests and leading to a nagging question: Are AI systems getting too smart to us to measure them?

This week, researchers at the Center for AI Safety and AI Scale are providing a possible answer to that question: A new assessment, called the “Ultimate Humanity Test,” that they claim is the most difficult test ever administered. sometimes in AI systems.

Humanity’s latest exam is the brainchild of Dan Hendrycks, a noted AI security researcher and director of the Center for AI Security. (The original name of the test, “Humanity’s Last Stand,” was rejected for being too dramatic.)

Mr. Hendrycks worked with Scale AI, an AI company where he is a consultant, to develop the test, which consists of approximately 3,000 multiple-choice and short-answer questions designed to test the capabilities of AI systems in fields ranging from analytic philosophy to rocket engineering. .

The questions were submitted by experts in these fields, including college professors and award-winning mathematicians, who were asked to ask extremely difficult questions to which they knew the answers.

Here, try your hand at a hummingbird anatomy question from the quiz:

Hummingbirds within the Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate insertion aponeurosis of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

Or, if physics is more your thing, try this:

A block is placed on a horizontal rail along which it can slide without friction. It is attached to the end of a massless rigid rod of length R. A mass is attached to the other end. Both objects have weight W. The system is initially stationary, with the mass directly above the block. The mass is given an infinitesimal thrust parallel to the rail. Assume that the system is designed so that the rod can rotate through 360 degrees without interruption. When the bar is horizontal, it carries the voltage T1. When the rod is vertical again, with the mass directly below the block, it carries tension T2. (Both of these quantities can be negative, which would indicate that the rod is in compression.) What is the value of (T1−T2)/W?

(I’d print the answers here, but that would break the test for any AI system being trained in this column. Also, I’m too stupid to verify the answers myself.)

Questions on the Humanities Final Exam went through a two-step filtering process. First, the presented questions were given to leading AI models to solve.

If the models couldn’t answer them (or if, in the case of multiple-choice questions, the models did worse than random guesses), the questions were given to a group of human reviewers, who refined them and verified the answers. correct. . The experts who wrote the highest-rated questions were paid between $500 and $5,000 per question, and received credit for contributing to the exam.

Kevin Zhou, a postdoctoral researcher in theoretical particle physics at the University of California, Berkeley, submitted a series of questions to the test. Three of his questions were selected, which he told me were “at the high end of what might be seen in a graduate examination.”

Mr. Hendrycks, who helped create a widely used AI test known as Massive Multitask Language Understanding, or MMLU, said he was inspired to create more difficult AI tests by a conversation with Elon Musk. (Mr. Hendrycks is also a security adviser to Mr. Musk’s AI company, xAI.) Mr. Musk, he said, raised concerns about existing tests given to AI models, which he thought were too broad. easy.

“Elon looked at the MMLU questions and said, ‘These are graduate level. I want things that a world-class expert can do,” said Mr. Hendrycks.

There are other tests that attempt to measure advanced AI skills in certain fields, such as FrontierMath, a test developed by Epoch AI, and ARC-AGI, a test developed by AI researcher François Chollet.

But the Humanity Final Exam aims to determine how good AI systems are at answering complex questions in a wide variety of academic subjects, giving us what can be thought of as an overall intelligence score.

“We’re trying to assess the degree to which AI can automate a lot of really hard intellectual work,” said Mr. Hendrycks.

After the list of questions was compiled, the researchers administered the Humanity Final Exam to six leading AI models, including Google’s Gemini 1.5 Pro and Anthropic’s Sonet Claude 3.5. All of them failed miserably. OpenAI’s o1 system scored the highest of the bunch, with a score of 8.3 percent.

(The New York Times has sued OpenAI and its partner, Microsoft, accusing them of copyright infringement of news content related to AI systems. OpenAI and Microsoft have denied the claims.)

Mr Hendrycks said he expected those results to rise quickly and potentially exceed 50 per cent by the end of the year. At that point, he said, AI systems can be considered “world-class oracles,” capable of answering questions on any topic more accurately than human experts. And we may have to look to other ways to measure AI’s impacts, like looking at economic data or judging whether it can make new discoveries in fields like math and science.

“You can imagine a better version of this where we can give questions that we don’t know the answers to yet and we’re able to verify that the model is able to help solve it for us,” said Summer Yue , Scales. AI research director and exam organizer.

Part of what’s so confusing about AI progress these days is how screwed up it is. We have AI models capable of diagnosing diseases more effectively than human doctors, winning silver medals at the International Mathematical Olympiad and beating top human programmers in competitive coding challenges.

But those same patterns sometimes struggle with basic tasks, like arithmetic or writing measured poetry. This has given them a reputation for being incredibly great at some things and completely useless at others, and has created very different impressions of how fast AI is improving, depending on whether you’re looking at the best results or the worst. bad.

This severity has also made it difficult to measure these models. I wrote last year that we need better evaluations of AI systems. I still believe it. But I also believe we need more creative methods to track AI progress that don’t rely on standardized tests, because most of what humans do—and what we fear AI will do better than us—doesn’t. can be captured in a written exam. .

Mr. Zhou, the theoretical particle physics researcher who submitted questions to the Humanity Final Exam, told me that while AI models were often impressive at answering complex questions, he did not consider them a threat to him and his colleagues, because their jobs involve much more than spitting out the correct answers.

“There’s a big gap between what it means to pass an exam and what it means to be a practicing physicist and researcher,” he said. “Even an AI that can answer these questions may not be ready to help with research, which is inherently less structured.”

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top