OpenAI unveils GPT-4o, an all-in-one model with real-time speech and vision capabilities, once again setting a new industry standard

Brain Titan
5 min readMay 14, 2024

--

OpenAI has just released GPT-4o, a new AI model that brings together all-in-one models for text, images, video, and speech. It can respond to the user’s needs in real time and answer you in real time through voice, and you can interrupt it at any time. It also has the visual ability to recognize objects and make fast response and answer based on vision, and has very strong logical reasoning ability. It is 2x faster than GPT4-turbo and 50% cheaper!

According to traditional benchmarks, the GPT-4o performs at the level of the GPT-4 Turbo in terms of textual, inference, and coding intelligence, as well as in terms of multilingual, audio, and visual features. language, audio, and visual capabilities also set new highs.

  • New features in GPT-4o:
  • Experience GPT-4-level intelligence
  • Get the response from the model and web page
  • Analyze data and create charts
  • Discussing the photo shoot
  • Uploading a file for summarization, writing, or analysis help
  • Discovering and using GPT and GPT stores
  • Build a More Helpful Experience by Remembering experience

Main Features and Functions:

  • Model Benefits: GPT-4o is the newest flagship model, with GPT-4 level intelligence, but faster and significantly improved capabilities in text, speech, and vision.
  • Image Understanding and Discussion: GPT-4o outperforms any existing model in understanding and discussing images shared by users. For example, users can take pictures of menus in different languages, talk to GPT-4o to translate, understand the history and importance of the food, and get recommendations.
  • Demonstration of math skills
  • Availability and user Access:
  • Multi-language support : GPT-4o’s language capabilities have improved in quality and speed, and ChatGPT now supports registration, login, user settings, and more in over 50 languages.
  • User Levels: GPT-4o is currently being rolled out to ChatGPT Plus and Team users, and will soon be available to Enterprise users. It is also starting to roll out to ChatGPT Free users, but with usage restrictions; Plus users have 5x the message limit of Free users, and Team and Enterprise users have higher limits.
  • Enhancing the popularity of intelligence and advanced tools:
  • Coding and data analysis capabilities

Integrated Interaction Capability:

  • Multimodal Inputs and Outputs: GPT-4o is the first model to integrate text, audio, and image inputs to generate outputs with any combination of text, audio, and images. This design significantly improves natural interaction with computers.

Performance Improvements and Cost Efficiencies:

  • Response time: The GPT-4o’s response time for audio input is extremely fast, down to 232 milliseconds, and averaging 320 milliseconds, which is similar to a human’s response time in a conversation.
  • Efficiency and Cost: In the API, GPT-4o is twice as fast as GPT-4 Turbo, at 50% lower cost, and with a 5x higher processing rate limit.

Advancements in speech patterns:

  • From multiple models to a single model: In contrast to the previous version, GPT-4o is trained end-to-end with a single model that handles all inputs and outputs. This avoids information loss and allows the model to directly process intonation, multiple speakers or background noise, etc., and to output laughter, sing or express emotion.

Testing and Iteration:

  • Extensive red-team testing: Red-team testing was conducted in collaboration with more than 70 external experts across the fields of social psychology, bias and fairness, and misinformation to identify the risks posed by the addition of the new modality and build safety interventions accordingly.
  • Ongoing risk mitigation: continue to identify and mitigate new risks.

Deployment and Availability:

  • Progressive rollout: Text and image functionality for GPT-4o has begun rolling out in ChatGPT. Developers can now also access GPT-4o as a text and visual model through the API.
  • Voice and video capabilities: New audio and video capabilities are planned to be rolled out to a small group of trusted partners in the coming weeks.

Some Other Updates

OpenAI is launching the new ChatGPT desktop app for macOS, which is designed to seamlessly integrate into whatever you’re doing on your computer. With a simple keyboard shortcut (Option + Space), you can instantly ask ChatGPT questions. You can also take screenshots and have discussions directly from the application. You can also have voice and video conversations with ChatGPT directly from your computer.

At the same time ChatGPT’s user interface has been revamped to make it friendlier and more conversational. You’ll see a new home screen, message layout, and more.

Also GPT-4o is a version of the video that was previously tested in the LMSys arena, namely im-4o. A version was tested on LMSys Arena i.e. im-also-a-good-gpt2-chatbot

Below are the results of the test…

Sam Altman on GPT-4o

In Sam Altman’s blog post, ‘GPT-4o,’ he highlights two major updates and points:

  1. Making powerful AI tools free or low-cost to users:
  • One of OpenAI’s missions is to provide users with powerful and efficient AI tools that are essentially free, such as ChatGPT, and free of distractions such as ads.
  • OpenAI was originally envisioned as a way to create AI and utilize it for all sorts of benefits to the world. What’s happening now is more of a scenario where OpenAI creates AI that will be used by others to create amazing results that will benefit everyone.
  • While OpenAI is a commercial organization that will find many products and services for a fee, its goal is to provide free, superior AI services to billions of users around the world.
  1. New voice (and video) modes are a computer interface experience like no other:
  • The new voice and video modes have been described as the best computer interface ever, giving the feeling of being like AI in a movie, an experience that is real and surprising.
  • Reaching near-human levels of responsiveness and expressiveness makes a significant difference. Interacting with a computer has never felt so natural.
  • The fast, smart, fun, natural, and useful features of the new system make talking to a computer more natural than ever.
  • By adding features such as personalization options, access to user information, and the ability to take action on behalf of the user, Sam Altman envisions a future full of exciting possibilities, where we are able to use computers to do more than we could have ever imagined before.

In closing, Sam Altman gave special thanks to the team for the tremendous effort they put into making these achievements a reality.

More detailed features and demonstrations: https://openai. com/index/hello-gpt-4o/

--

--

No responses yet