Abstract
The processes underlying human cognition are often divided into System 1, which involves fast, intuitive thinking, and System 2, which involves slow, deliberate reasoning. Previously, large language models were criticized for lacking the deeper, more analytical capabilities of System 2. In September 2024, OpenAI introduced the o1 model series, designed to handle System 2-like reasoning. While OpenAI’s benchmarks are promising, independent validation is still needed. In this study, we tested the o1-preview model twice on the Dutch ‘Mathematics B’ final exam. It scored a near-perfect 76 and 74 out of 76 points. For context, only 24 out of 16,414 students in the Netherlands achieved a perfect score. By comparison, the GPT-4o model scored 66 and 62 out of 76, well above the Dutch students’ average of 40.63 points. Neither model had access to the exam figures. Since there was a risk of model contamination (i.e., the knowledge cutoff for o1-preview and GPT-4o was after the exam was published online), we repeated the procedure with a new Mathematics B exam that was published after the cutoff date. The results again indicated that o1-preview performed strongly (97.8th percentile), which suggests that contamination was not a factor. We also show that there is some variability in the output of o1-preview, which means that sometimes there is ‘luck’ (the answer is correct) or ‘bad luck’ (the output has diverged into something that is incorrect). We demonstrate that the self-consistency approach, where repeated prompts are given and the most common answer is selected, is a useful strategy for identifying the correct answer. It is concluded that while OpenAI’s new model series holds great potential, certain risks must be considered.
| Original language | English |
|---|---|
| Article number | 278 |
| Number of pages | 13 |
| Journal | Computers |
| Volume | 13 |
| Issue number | 11 |
| DOIs | |
| Publication status | Published - 2024 |
Keywords
- large language models
- reasoning
- mathematics
- chain of thought
Fingerprint
Dive into the research topics of 'System 2 Thinking in OpenAI’s o1-Preview Model: Near-Perfect Performance on a Mathematics Exam'. Together they form a unique fingerprint.Datasets
-
Supplementary data for the paper 'System 2 thinking in OpenAI’s o1-preview model: Near-perfect performance on a mathematics exam'
de Winter, J. C. F. (Creator), Dodou, D. (Creator) & Eisma, Y. B. (Creator), TU Delft - 4TU.ResearchData, 23 Sept 2024
DOI: 10.4121/2e663686-f656-4ff2-bb21-567ba4d4f03e
Dataset/Software: Dataset