GameNGen: A game engine driven entirely by neural models that generates game graphics in real time based on player actions

5 min readAug 29, 2024

https://kcgod.com/gamengen-a-game-engine-driven-by-neural-models

GameNGen, developed by Google DeepMind, is a game engine driven entirely by neural models can simulate complex game environments in real time. It allows players to interact with the game world like traditional game engines, but it relies on neural networks to create game graphics instead of pre-made images or animations.

GameNGen can generate game images in real time while you play the game. This means that every frame you see is generated by the engine on the fly, rather than stored in advance. You can interact with the game through the keyboard or controller like playing a normal game, and GameNGen will generate the next frame of the game screen based on your operation. The generated game screen is very close to the effect of the real game, so you can hardly feel that it is an AI-generated image, but rather like playing a real game.

Main Features of GameNGen

1. Real-time game simulation

Neural model driven: Unlike traditional game engines, GameNGen relies entirely on neural network models to generate game images. Specifically, it uses a diffusion model that can predict and generate the next frame based on previous game frames and player operations.

Real-time simulation: GameNGen can simulate complex game environments under real-time conditions. GameNGen can simulate the classic game “DOOM” in real time at a speed of more than 20 frames per second. This means that it can generate pictures while the game is in progress and respond accordingly based on the player’s input.

2. High-quality image generation

Using a generative diffusion model, GameNGen is able to generate high-quality game images. The image quality reaches a peak signal-to-noise ratio (PSNR) of 29.4, which is equivalent to the quality standard of lossy JPEG compression. This ensures that the generated images are visually close to the effects of real games.

3. Automatic regression generation

GameNGen continuously predicts and generates game frames through automatic regression generation, ensuring that long game sequences can still maintain image quality. In order to prevent the degradation of image quality during long-term operation, the system introduces noise enhancement technology.

4. Data Generation Combined with Reinforcement Learning

GameNGen generates human-like game data by training reinforcement learning (RL) agents. This data is used to train generative models that enable efficient and realistic simulations in a variety of scenarios.

5. Latent Decoder Fine-tuning

In order to improve the details of the generated images, especially in terms of HUD display in games, GameNGen fine-tuned the potential decoder of Stable Diffusion to ensure the accuracy and clarity of the output images.

6. Scalability and future potential

GameNGen demonstrates the possibility of developing game engines under the neural network model, indicating that future game development may shift to generating games through neural network weights instead of relying on traditional programming. This approach may reduce the cost of game development and make game creation more popular.

Technical Principle of GameNGen

1. Generative Diffusion Model

Diffusion model: GameNGen uses a diffusion model to generate each frame of the game. Diffusion model is a type of generative model that gradually removes noise from the data to recover a clear image from random noise. GameNGen is based on the Stable Diffusion v1.4 model, removing the text condition and conditioning the model on a combination of historical frames and actions to predict the next frame of the game.

Denoising process: The model goes through a multi-step denoising process when generating images. In order to speed up this process and maintain high quality, GameNGen uses the DDIM (Denoising Diffusion Implicit Models) sampling method to generate clear images with only a small number of denoising steps (e.g. 4 steps).

2. Automatic regression generation

Auto-regression predictionnoise enhancement technology: In game simulation, each frame is generated based on previously generated frames and the player’s actions. GameNGen generates the next frame by encoding past frames and actions as latent variables and feeding them into the model. In order to prevent drift problems during generation (i.e., the quality of model output gradually decreases over time), is introduced . This technology adds different degrees of Gaussian noise to past frames during training and lets the model learn how to recover the original information from these noisy data, thereby improving the stability of the model in long-term generation.

3. Data Collection and Reinforcement Learning

RL agent training: To generate training data, GameNGen first trained a reinforcement learning agent to play the game. The purpose of this agent was not to get the highest game score, but to generate diverse, human-like game data. This data was then used to train the diffusion model, enabling it to effectively simulate various situations in the game.

Data generation and recording: During the training of the agent, all game trajectories (including the agent’s actions and the corresponding game frames) are recorded to form a large-scale dataset. This data provides rich samples for the training of the diffusion model.

4. Latent Decoder Fine-tuning

Fine-tuning the decoder: The latent space decoder in the Stable Diffusion model is fine-tuned to adapt to the generation of game screens. The original decoder is designed to generate general images, while the game screen has many fine elements (such as HUD displays) that require higher detail fidelity. Through fine-tuning, the decoder is better able to handle these details in the game, and the generated images are clearer and more accurate.

5. Model Reasoning and Optimization

Inference optimization: During inference, GameNGen uses a small number of denoising steps to speed up generation. At the same time, in order to maintain high quality with fewer steps, the authors also studied optimizing the quality of single-step generation through model distillation techniques. This technique trains a simplified version of the model to generate images close to the quality of the full-step model in one or two steps, thereby achieving higher frame rates.

……

For more info ↓