Fei-Fei Li: With Spatial Intelligence, Artificial Intelligence Will Understand the Real World
About 540 million years ago, the Earth was shrouded in darkness, not because there was no light, but because lifeforms lacked the ability to see. The appearance of trilobites marked the beginning of vision, revolutionizing life and triggering the Cambrian explosion. Fast-forward to today, and the evolution of vision has inspired advances in artificial intelligence (AI), particularly computer vision and spatial intelligence.
Key points:
- The world before vision:
- 540 million years ago, no living creature had the ability to see, despite the presence of light.
- The appearance of trilobites introduced the concept of vision, enabling the first organisms to perceive light.
- Cambrian Explosion:
- The development of vision triggered the Cambrian Explosion, a period of rapid evolutionary diversification.
- Vision shifted from passive reception of light to an active process involving the nervous system, which in turn led to the creation of intelligence.
- Advances in Computer Vision:
- Nine years ago, major advances in computer vision were reported, emphasizing the convergence of neural networks, GPUs, and big data.
- The ImageNet project spent years curating 15 million images, laying the groundwork for the modern AI era.
- From image labeling to generating models:
- At first, labeling images was a major breakthrough, but algorithms rapidly improved in speed and accuracy.
- The annual ImageNet Challenge measures this progress and further develops algorithms that can segment objects or predict their dynamic relationships.
- The Rise of Generative AI:
- With the help of diffusion models, generative AI algorithms are able to transform human-prompted sentences into brand-new photos and videos.
- Early computer vision algorithms were able to describe photographs in natural language, and today are able to transform from text to images and video.
- The Future of Spatial Intelligence:
- Spatial Intelligence teaches computers to see, learn, and do better in 3D space.
- Researchers at Google have developed an algorithm that can transform a bunch of photos into 3D space, and further research can generate 3D shapes from a single image.
- Robotics Learning and Behavior Database:
- Uses simulated environments of 3D spatial models to train computers and robots to act in a 3D world.
- The Behavioral Database project shows how to teach a robot to perform a variety of tasks, such as opening a drawer and unplugging a charging cell phone.
- Applications in Health:
- AI applications in healthcare, such as detecting healthcare workers not washing their hands properly and tracking surgical instruments.
- Research on controlling robots through brainwaves to perform daily tasks for severely paralyzed patients.
- The Far-Reaching Impact of Spatial Intelligence:
- Spatial Intelligence, which enables machines to interact with humans and the 3D world, will have a far-reaching impact on the future.
- By placing humans at the center of technological development, spatial intelligence promises to be a useful tool and a trusted partner, enhancing productivity and human dignity.
Text version of the full article
The Evolution of Vision and the Future of Artificial Intelligence
About 540 million years ago, the Earth was shrouded in darkness. This was not because of a lack of light, but because lifeforms had not yet developed the ability to see. Although sunlight could penetrate oceans up to 1,000 meters deep and hydrothermal vents on the seafloor emitted light in which life flourished, not a single eye could be found in these ancient oceans, not a retina, cornea or lens. All the light and life had never been seen. The concept of seeing did not even exist then, and this ability was never realized until the day it was created.
Trilobites and the Birth of Sight
For reasons we are just beginning to understand, trilobites emerged as the first creatures capable of perceiving light. They were the first to realize that there was something outside of themselves, a world of multiple ‘selves’. The birth of sight is thought to have triggered the Cambrian explosion, a period in which a large number of animal species appear in the fossil record. Vision began as a passive experience, simply letting light in, but soon became more active. The nervous system began to evolve, sight was transformed into insight, seeing became understanding, and understanding led to action, all of which gave rise to intelligence.
Advances in computer vision
Today, we’re no longer satisfied with the visual intelligence that nature gave us. Curiosity has driven us to create machines that can ‘see’ as well as we do, and even more intelligently. Nine years ago, I published an early progress report on computer vision in this arena. Three powerful forces were converging for the first time: a family of algorithms called neural networks, fast, specialized hardware graphics processing units (GPUs), and big data, such as ImageNet, which my lab had spent years curating and which contained 15 million images, and together they were driving the modern era of AI.
We’ve come a long way. At the time, just labeling images was a major breakthrough, but the speed and accuracy of these algorithms have rapidly improved. The annual ImageNet Challenge measures this progress, and we’ve gone even further, creating algorithms that can segment objects or predict their dynamic relationships, all done by my students and collaborators.
The Rise of Generative AI
What’s more, think back to my last demonstration of the first computer vision algorithm capable of describing photographs in natural human language, which I did with my brilliant former student Andrei Kapasi. At the time, I tentatively asked him, ‘Andrei, can we get the computer to do the opposite?’ He laughed and said, ‘Haha, that’s impossible.’ However, as you can see, this post demonstrates how the impossible has recently become possible. This is thanks to a class of algorithms called diffusion models, which are driving today’s generative AI, capable of transforming human-prompted sentences into brand new photos and videos. Many of you have already seen the latest impressive results from OpenAI’s Sora, but even without a lot of GPUs, my students and collaborators have developed a generative video model called Walt that predates Sora by a few months. What you’re seeing now are some of those results.
Of course, there’s room for improvement. Look at the cat’s eyes, it’s a ‘cat’s meow’ that it’s traveling under the waves without getting wet.
The Future of Spatial Intelligence
For years I’ve been saying that taking a picture is not the same as seeing and understanding. Today, I would like to add that seeing is not enough. Seeing is about action and learning. As we act in the world in three dimensions of space and time, we learn and learn to see and act better. Nature has created a virtuous cycle driven by ‘spatial intelligence’.
To show you what your spatial intelligence is constantly doing, look at this picture. Raise your hand if you think you want to do something.
In less than a second, your brain sees the geometry of this glass of water, its position in three-dimensional space, and its relationship to the table, the cat, and everything else. You can predict what will happen next. The impulse to act is inherent in all spatially intelligent life forms, and it links perception to action.
If we want AI to go beyond its current capabilities, we need more than just AI that can see and speak; we need AI that can act.
In fact, we’re making exciting progress in spatial intelligence. The most recent milestone in spatial intelligence is teaching computers to see, learn, do and learn to see and do better. This has not been easy. It took nature millions of years to evolve spatial intelligence, which relies on the eye receiving light, projecting a two-dimensional image onto the retina, and then the brain converting that data into three-dimensional information. Recently, a group of researchers at Google developed an algorithm that can transform a bunch of photos into three dimensions, such as the example we show here. My students and collaborators went a step further and created an algorithm that can generate 3D shapes from a single input image. Here are more examples. Recalling that we talked about computer programs that can translate human sentences into video, a group of researchers at the University of Michigan found a way to translate sentences into three-dimensional room layouts, as shown here. And my colleagues at Stanford and their students have developed an algorithm that can generate an infinite number of possible spaces for viewers to explore from a single image.
Robot Learning and Behavioral Databases
These are the first signs of future possibilities. A future where humans can translate the entire world into digital form and simulate its richness and nuance. Nature does this implicitly in our individual minds, and spatial intelligence technologies hope to do it for our collective consciousness.
A new era in this virtuous cycle is unfolding before our eyes as advances in spatial intelligence accelerate. This back-and-forth communication is catalyzing robotic learning, a key component of any intelligent system that needs to understand and interact with the three-dimensional world.
A decade ago, my lab launched ImageNet, a database of millions of high-quality photos, to help train computers to see. Today, we’re doing the same thing with behaviors and actions to train computers and robots how to act in the 3-D world. But unlike collecting static images, we develop simulated environments driven by 3D spatial models that allow computers to learn to act in an infinite variety of possibilities. What you’re seeing now is just a small sampling of the examples that are used to teach our robots, in a project led by my lab called Behavior.
We are also making exciting progress in robotic linguistic intelligence. Using inputs based on large-scale language models, my students and collaborators were one of the first teams to demonstrate that a robot arm can perform a wide range of tasks based on verbal commands, such as opening a drawer or unplugging a charging cell phone, or even making sandwiches, using bread, lettuce, tomatoes, and even putting out a napkin for the user. While I usually wish my sandwiches were a little more informative, this is a good start.
Applications in health
In that primordial ocean, in our antiquity, the emergence of the ability to see triggered the Cambrian explosion with other life forms. Today, that light is shining into the digital mind. Spatial intelligence is enabling machines to interact not only with each other, but also with humans and the real or virtual three-dimensional world. As this future takes shape, it will have a profound impact on many lives. In healthcare, for example, over the past decade my lab has taken the first steps in applying AI to the challenges affecting patient outcomes and provider burnout. With collaborators at Stanford Medical School and its partner hospitals, we’re piloting smart sensors that can detect whether healthcare workers are properly washing their hands before entering a room, or tracking surgical instruments, or alerting care teams when a patient is at physical risk, such as a fall. We think of these technologies as a kind of ambient intelligence, similar to extra eyes that do make a difference. But I hope to provide more interactive help for our patients, clinicians and caregivers, who also desperately need an extra helping hand. Imagine an autonomous robot transporting medical supplies while the provider focuses on the patient, or augmented reality guiding surgeons to safer, faster, less invasive procedures.
Or imagine a severely paralyzed patient controlling a robot through brainwaves to perform everyday tasks. What you’re seeing is a glimpse into the future of a recent pilot study from my lab. In this video, the robot arm, controlled only by EEG signals (non-invasively collected via an EEG cap), is cooking a meal of Japanese sukiyaki.
The advent of vision turned a dark world upside down half a billion years ago. It triggered the most profound evolutionary process of all: the development of intelligence in the animal world.AI’s astonishing progress over the past decade has been equally remarkable. But I believe the full potential of the digital Cambrian explosion will only be fully realized when we endow computers and robots with spatial intelligence, as nature has done for us.
It’s an exciting time as we teach our digital companions to learn to reason and interact in the beautiful three-dimensional space we call home, and to create more new worlds that we can explore together. Realizing this future will not be easy. It will require all of us to take thoughtful steps to develop technologies that always put humans at the center. But if we get it right, computers and robots with spatial intelligence will not only be useful tools, they will be trusted companions, enhancing our productivity and our humanity, while honoring our individual dignity and enhancing our collective prosperity.
The future I’m most excited about is one in which AI becomes sharper, more insightful, and spatially aware, and they join us in our quest for better ways to create a better world.
Original video:https://www.ted.com/