Discover OpenAI’s New O1 Inference Model: Dive into the Technical Report Now!

Explore OpenAI’s new o1-preview reasoning models, surpassing GPT-4o in complex problem-solving and expert-level reasoning

Brain Titan
10 min readSep 13, 2024

OpenAI has released a new series of “o1-preview” reasoning models, a series of AI designed to solve complex problems and capable of complex reasoning.

Compared to previous models, these new models spend more time thinking before responding, and have outstanding performance in fields such as science, coding, and mathematics. According to official reports, its reasoning performance far exceeds that of GPT-4o, and it can surpass the level of human experts in many benchmark tests.

The new reasoning models learn to spend more time reasoning about problems, try different strategies, and correct mistakes, just like humans do. They learn through training to analyze problems more effectively, try multiple strategies, and be able to identify and correct mistakes. In this way, the models are able to perform well on more complex tasks.

o1-preview

Technical Principle of OpenAI o1-preview

Large-scale reinforcement learning algorithms

OpenAI uses a large-scale reinforcement learning algorithm to train the o1-preview model. Through efficient data training, the algorithm allows the model to learn how to use the “Chain of Thought” to think about problems productively. During the training process, the model will continuously optimize its chain of thought through reinforcement learning, ultimately improving its problem-solving ability.

OpenAI found that the performance of the o1 model will significantly improve with the increase of reinforcement learning time (computation during training) and inference time (computation during testing). This inference-based training method is different from the traditional large-scale language model (LLM) pre-training method and has unique scalability advantages.

o1 performance improves steadily in both training-time and test-time calculations

Chain of Thought

The o1-preview model significantly enhances its ability in complex reasoning tasks through thought chaining . The basic concept of thought chaining is similar to the process of humans thinking about difficult problems: breaking down the problem step by step, trying different strategies, and correcting mistakes. Through reinforcement learning training, o1-preview is able to think deeply before answering questions and gradually refine the steps.

This reasoning method significantly improves o1-preview’s performance in complex tasks. For example, o1-preview can identify the key steps in a problem through thought chains and solve them step by step. This reasoning mode is particularly suitable for tasks that require multi-step reasoning, such as complex math problems or difficult programming tasks.

For example:

  • On some complex problems, o1-preview can gradually break down the difficulty of the problem and eventually find the correct answer. This is very similar to the way humans think step by step when facing challenging problems.

Evaluation and Benchmarking of o1-preview

Evaluation and Benchmarking of o1-preview

In OpenAI’s internal tests, the next-generation models performed at nearly PhD-level levels in solving complex problems, particularly in tasks in subjects like physics, chemistry, and biology.

AIME (American Invitational Mathematics Examination): In the qualifying exam for the International Mathematical Olympiad (IMO), the GPT-4o model only solved 13% of the problems correctly, while the new reasoning model solved 83% of the problems correctly.

  • GPT-4o only solved 12% of the questions (an average of 1.8 questions answered for every 15 questions).
  • o1-preview solves 74% of the problems on average (11.1/15), far surpassing GPT-4o.
  • When using the consensus evaluation method (64 sample consensus), the solution rate of o1-preview increased to 83%.
  • After rescoring 1,000 examples, the model’s final score reached 93% (13.9/15), which is enough to put it among the top 500 high school students in the United States and surpass the qualifying score for the United States Mathematical Olympiad.

GPQA (Expert-Level Test in Physics, Chemistry, and Biology): On the GPQA-diamond benchmark, o1-preview surpassed the performance of PhD-level experts, becoming the first AI model to outperform a PhD on this benchmark. This does not mean that o1 is stronger than PhDs in all tasks, but that it demonstrates a level of ability to solve problems that exceeds PhDs.

  • To make a fair comparison, OpenAI recruited experts with PhDs to answer questions on the GPQA-diamond benchmark. o1-preview successfully surpassed these human experts, becoming the first AI model to surpass PhD-level performance on this benchmark.
  • It should be noted that this does not mean that the o1-preview model is stronger than PhD experts on all tasks, but it shows that it has the ability to surpass experts on certain specific problems.

MMLU (Multi-Task Language Understanding): o1-preview surpasses GPT-4o in 54 out of 57 subcategories. In particular, when visual perception is enabled, the o1 model achieves 78.2% performance on the MMLU benchmark, competing with human experts for the first time.

  • GPT-4o surpasses o1-preview in only 3 out of 57 subcategories.
  • o1-preview outperforms GPT-4o in 54 subcategories, demonstrating its broader reasoning capabilities.
  • Especially when the visual perception function is enabled, o1-preview scores 78.2% in MMLU , which is the first AI model performance that can compete with human experts.
comparison between gpt40 and o1

Coding ability: The new model also performs very well in coding ability. In the Codeforces programming competition, the o1 model also performed well, surpassing 93% of competitors. Especially its programming ability, after reinforcement learning, o1 can efficiently solve complex algorithmic problems.

  • In the 2024 International Olympiad in Informatics (IOI), OpenAI trained a model based on o1-preview to participate in the competition and compete with human players under the same conditions.
  • The model scored 213 points in the competition, ranking in the 49th percentile, outperforming most of the competitors.
  • The model solved six complex algorithmic problems in 10 hours, and each problem allowed 50 submissions. The model’s performance was significantly improved through multiple sample submissions.
  • In the Codeforces programming competition, the o1-preview model achieved an Elo score of 1807, which puts it ahead of 93% of human competitors.
  • In contrast, GPT-4o’s Elo score is only 808 , which is in the 11th percentile of human contestants .
  • Through these evaluations, o1-preview demonstrates its significant advantages in programming tasks, especially in solving complex algorithmic and logical problems.
percentile

Human Preference Assessment: In addition to academic benchmarks, OpenAI also conducted a human preference evaluation by showing o1-preview and GPT-4o anonymous answers to the same questions to human reviewers, who then selected the answer they preferred based on the quality of the answer.

  • In domains involving reasoning tasks (such as data analysis, coding, mathematics, etc.), human reviewers clearly prefer the answers of the o1-preview model.
  • However, GPT-4o outperforms o1-preview in some natural language processing tasks, which shows that o1-preview is not suitable for all application scenarios, especially in language generation and natural language understanding.
human preference by domain o1-preview vs gpt-4o

You can read more detailed data in OpenAI’s technical research post.

Applicable Users

The new reasoning model will be particularly suitable for handling complex problems in science, programming, mathematics and other fields. Here are some possible application scenarios:

  • Medical field: Researchers can use the o1-preview model to annotate complex cell sequencing data.
  • Physics: Physicists can use this model to generate complex mathematical formulas, especially calculations in the field of quantum optics.
  • Developers: In the development field, o1-preview can help developers build and execute multi-step workflows, simplifying the processing of complex tasks.

OpenAI o1-mini

To meet the needs of developers, OpenAI also released OpenAI o1-mini , a smaller and faster inference model focused on code generation and debugging. The o1-mini model is cheaper than o1-preview, with a cost reduction of 80% , and is suitable for application scenarios that require reasoning capabilities but do not require extensive world knowledge.

Advantages of o1-mini

  • The model is particularly well suited for coding tasks and can accurately generate and debug complex code.
  • The o1-mini requires fewer computing resources, so it excels in applications that require efficiency, speed, and cost control.
  • o1-mini80% is a smaller but efficient inference model that costs less than OpenAI’s o1-preview and o1, but has nearly the same inference capabilities in STEM fields as o1.
  • Today, o1-mini is officially released to API level 5 users, with a more competitive price than o1-preview.
  • ChatGPT Plus, Teams, Enterprise, and Education users can also use o1-mini as an alternative to o1-preview with higher rate limits and lower latency.
math performance vs inference cost

Optimizing STEM Reasoning

Compared to large language models such as o1, o1-mini is optimized for STEM reasoning tasks. While large models such as o1 have extensive world knowledge, they can be expensive and slow to run in real applications. In contrast, o1-mini is optimized to focus on reasoning tasks and excels in areas such as mathematics and coding.

o1-mini uses the same computationally expensive reinforcement learning (RL) pipeline as o1 during pre-training, resulting in similar performance on many reasoning tasks, but at a significantly lower cost. Although o1-mini performs worse on tasks requiring non-STEM knowledge, its performance is very close to o1-preview and o1 in the field of STEM reasoning.

Optimizing STEM Reasoning

Mathematical performance vs reasoning cost

The o1-mini performed well on several STEM benchmarks, especially on math and programming tasks, showing strong reasoning capabilities.

  1. Mathematical performance: In the AIME (American Invitational Mathematics Exam) high school math competition, o1-mini scored 70.0%, close to o1’s 74.4% and significantly higher than o1-preview’s 44.6%. o1-mini’s performance (solving about 11/15 problems) puts him in the top 500 high school students in the United States.
  2. Programming performance:
  • On the Codeforces programming competition website, o1-mini reached 1650 Elo, close to o1’s 1673 Elo and higher than o1-preview’s 1258 Elo. This Elo score puts o1-mini among 86% of programmers on the Codeforces platform.
  • o1-mini performed well on the HumanEval programming benchmark and the high school-level cybersecurity Capture the Flag challenge (CTF).
o1-mini performed well on the HumanEval programming benchmark

3. Academic Reasoning: On some academic reasoning benchmarks, such as GPQA (science) and MATH-500, o1-mini performs better than GPT-4o, but due to the lack of extensive world knowledge, o1-mini performs worse than GPT-4o on tasks such as MMLU (multi-task language understanding) and lags behind o1-preview.

4. Human preference evaluation: In the test comparing o1-mini and GPT-4o in various domains by human reviewers, the same method as the comparison of o1-preview and GPT-4o is used . In the domains heavy on reasoning, o1-mini is more popular than GPT-4o, but in the domains focused on language, o1-mini is less popular than GPT-4o.

human preference evaluation vs chatgpt-4o-latest

Performance comparison

  • AIME Mathematics Competition: o1-mini scored 70.0% , close to o1’s 74.4% , and significantly higher than o1-preview’s 44.6% .
  • Codeforces Programming: o1-mini’s Elo score is 1650 , close to o1’s 1673 and better than o1-preview’s 1258 .
  • HumanEval Programming Benchmark: o1-mini’s accuracy is 92.4% , which is the same as o1-preview and higher than GPT-4o’s 90.2% .
  • Cybersecurity CTF: o1-mini performed at 43.0% , higher than o1-preview’s 28.7% and GPT-4o’s 20.0%.

Model speed

As a specific example, we compared the responses of GPT-4o, o1-mini, and o1-preview on a word inference question. While GPT-4o did not answer it correctly, o1-mini and o1-preview both answered it correctly, and o1-mini answered it about 3–5 times faster.

speed comparison between GPT-40, o1-mini, o1-preview

Limitation of o1-preview

  • Limits: o1-preview 30/week, o1-mini 50/week, T5 developers can access its API, up to 20 concurrent requests per minute.
  • Does not support web browsing, file and image uploading, drawing
  • The API does not support fields such as system and tool, and methods such as json mode and structured output.
  • The model says it has a maximum output of 32k/64k, but the actual output is far less than that.
  • From the perspective of actual testing, it was found that o1 is not so much a model as it is an agent based on GPT-4o.

Prices and Restrictions of o1 models

Currently, the o1 series models can be accessed through the ChatGPT web version or API:

o1-preview:

  • 128k context.
  • 32k max output.
  • Reasoning models designed to solve complex problems in various fields.
  • The training data is as of October 23.

o1-mini:

  • 128k context.
  • 64k maximum output.
  • A faster, more economical reasoning model that excels at programming, math, and science.
  • The training data is as of October 23.
o1-previes & o1-mini
  • For the ChatGPT web version, only Plus and Team users can access it now. For Enterprise and Edu users, you will have to wait another week:
  • o1-preview: 30 items/week
  • o1-mini: 50 pieces/week
how to use openai o1
  • For API users, if your level is Tier 5 (payment amount > $1,000), you can now call through the interface:
  • o1-preview: 20 RPM, 30,000,000 TPM
  • o1-mini: 20 RPM, 150,000,000 TPM
for API users

Some Cases Using o1

……

For more use cases of o1↓

More about AI: https://kcgod.com

--

--