Jensen Huang and Ilya: Discovering the World Model Through LLM
In an interview between Nvidia Jensen Huang and Ilya Sutskever at the beginning of the year, Ilya put forward a point: LLM does far more than predict the next word based on probability. It is also learning the model of our real world, and the text is an actual projection. Here is the text of the video:
You can think of it this way: when we train a huge neural network to accurately predict the next word in all kinds of text on the Internet, we are actually learning a “world model.” At first glance, it seems like we are just learning statistical correlations in text. But in fact, in order to accurately learn the statistical correlations in text and effectively compress this information, the neural network actually learns some representation of the process that produced these texts.
These texts are actually a projection of the real world. The outside world seems to cast its own shadow on this text. As a result, neural networks learn not just textual information, but much more about the world, people’s emotional states, their hopes, dreams, motivations, interactions, and the environment in which we live. What the neural network learns is a compressed, abstract and practical expression of this information. This is the knowledge gained by accurately predicting the next word.
Going one step further, the more accurately we can predict the next word, the higher fidelity and resolution we can achieve in the process. This is the task of the pre-training phase. However, this stage does not dictate the specific behavior we want the neural network to exhibit. You see, a language model, what it’s really trying to do is answer the following question: If I find a random piece of text on the Internet, and it starts with a certain prefix, a certain prompt, what does it complete into? If you just randomly find a text on the internet.
But this is different from me wanting an honest assistant, a helpful assistant, an assistant who will follow certain rules without breaking them. This requires additional training. This is the stage where we do fine-tuning and reinforcement learning, which comes from human teachers as well as other forms of AI assistance. This is not just reinforcement learning from human teachers, but also reinforcement learning from collaboration between humans and AI. Our teachers are working alongside the AI, teaching our AI how to act.
But here, we are not teaching it new knowledge, we are teaching it, communicating with it, telling it what we want it to be. This process, the second stage, is also extremely important. The better we do at this second stage, the more useful and reliable this neural network will be. Therefore, the second stage is also very important. This is based on the first stage, learning as much as possible about the world from the projection of the world. This is the next task.