World Model: Mind Over Matter

World Model: Mind over matter

AI companies want to build machines that understand the physical world. If they succeed, it will be the most consequential leap in the technology since the transformer.

CHUPPALA NAGESH BHUSHAN | Jun 12th 2026 | HYDERABAD

IMAGINE asking someone for directions in an unfamiliar city. If they know the place, they can improvise, reroute around a closed street and adapt on the fly. If they are merely repeating a memorised script, a single detour leaves them helpless. Today's AI systems, for all their dazzling fluency, are closer to the second kind of navigator. A new generation of research aims to produce the first.

The idea goes by the name of a world model. In its simplest form, it is an internal simulation of reality: a mental map of how objects move, how causes produce effects and how actions ripple through an environment. Humans and animals build such models continuously; they are why you do not need to stub your toe twice on the same furniture leg in the dark. Getting machines to do the same has proved surprisingly hard. Now, with billions of dollars flowing into the effort and several of the field's most celebrated researchers staking their reputations on it, the attempt is entering a new phase.

The timing is not coincidental. Large language models (LLMs) — the technology behind ChatGPT, Claude and their kin — have plateaued in certain respects, prompting investors and researchers alike to ask what comes next. World models are the most plausible answer many of them have found. The stakes could hardly be higher: success would unlock AI's entry into the physical world — into factories, hospitals, roads and homes.

"Building an AI that can compose a novel is far easier than one that can fold laundry. To bridge that gap, you need something called a world model."

THE LIMITS OF LANGUAGE

LLMs work by predicting the next token in a sequence, a process that, when scaled to hundreds of billions of parameters and trained on most of the written internet, produces uncanny results. They can summarise documents, write code, answer questions and hold extended conversations. What they cannot reliably do is reason about the physical world.

A striking illustration comes from a study that trained a language model on a database of simulated New York City taxi trips. The model could give accurate directions between Manhattan addresses — until it was forced to make an occasional detour, at which point it failed entirely. The implication is uncomfortable: the model had learned statistical regularities in route descriptions, not an actual map of the city. Change the conditions slightly and the illusion evaporates.

This brittleness matters enormously for applications beyond text generation. A robot arm that has "read" every manual ever written about grasping objects still has no intuitive sense of friction, weight or the way a glass will slide if tilted past a critical angle. An autonomous vehicle that has processed millions of descriptions of traffic cannot feel the difference between ice and tarmac. Language, it turns out, is a very lossy encoding of physical reality.

A CROWDED STARTING LINE

The competitive landscape has shifted rapidly. Google DeepMind's Genie series has progressed from generating simple interactive game environments in 2024 to producing photorealistic, real-time three-dimensional worlds from text prompts — rendered at 24 frames per second — by the middle of 2025. In February 2026, Waymo, the autonomous-driving arm of Alphabet, adopted Genie 3 to build a specialised world model for driving simulation, one capable of generating the rare, dangerous edge cases — a child running into the road, a lorry jackknifed across three lanes — that real-world fleets encounter too infrequently to train on.

Nvidia entered the fray with Cosmos, launched at the Consumer Electronics Show in January 2025. The platform, trained on nine thousand trillion tokens drawn from 20 million hours of real-world footage spanning industrial settings, human interactions and driving scenarios, generates physics-aware videos that predict how environments will evolve. Within a year, the models had been downloaded more than two million times, adopted by robotics and automotive companies hungry for synthetic training data.

Meta's contribution, V-JEPA 2, takes a different approach. Rather than generating video, it learns joint embeddings — compressed representations — of visual scenes, allowing it to anticipate the outcomes of actions without simulating every pixel. The result is a system that can perform physical reasoning at a fraction of the computational cost of full generative models.

Perhaps the most consequential development, at least symbolically, was the departure of Yann LeCun — one of the three researchers most often credited with founding modern deep learning — from Meta in December 2025 to start his own company. Advanced Machine Intelligence (AMI) Labs, headquartered in Paris, entered fundraising conversations seeking €500m at a €3bn valuation before releasing a single product. The ambition, as LeCun describes it, is to build AI systems that understand physics, maintain persistent memory across time and plan complex sequences of actions — rather than predicting the next word in a sentence. Fei-Fei Li, the Stanford professor who co-invented ImageNet, is pursuing a parallel track at World Labs, whose Marble software generates interactive three-dimensional environments from text, images and video.

THE KEY PLAYERS IN THE WORLD-MODEL RACE

Google DeepMind (Genie 3) — photorealistic interactive 3D worlds; adopted by Waymo for driving simulation

Nvidia (Cosmos) — physics-aware video prediction; 2m+ downloads; targets robotics and autonomous vehicles

Meta (V-JEPA 2) — joint-embedding model for physical reasoning without full video generation

AMI Labs (Yann LeCun) — physics, persistent memory, long-horizon planning; €3bn pre-launch valuation

World Labs (Fei-Fei Li) — spatial intelligence; Marble generates 3D worlds from multimodal prompts

OpenAI — redirected Sora video resources toward "longer-term world simulation research"

FROM THE FACTORY FLOOR TO THE OPERATING THEATRE

The industries eyeing world models most keenly are those where the cost of failure is high, variability is enormous and annotated training data is scarce — which is to say, most of the physical economy.

In manufacturing, the promise is a step-change in robotic dexterity. Today's factory robots are marvels of precision within tightly controlled environments, but they are hopeless in the face of the unexpected: a box placed at the wrong angle, a part with a slightly different surface finish, a conveyor belt running fractionally too fast. A robot equipped with a world model could reason about these deviations in real time, adjusting its grip, trajectory and force without needing to be explicitly reprogrammed. BMW, Siemens and several Japanese electronics manufacturers are among those funding research in this direction.

Healthcare offers equally compelling possibilities. Surgical robots already assist in tens of thousands of procedures every year, but they operate under close human supervision. A world model capable of representing the three-dimensional structure of tissue, predicting how organs shift under gentle pressure and anticipating the effect of each incision could, in principle, extend the robot's useful autonomy — reducing surgeon fatigue, improving consistency and enabling complex procedures in facilities that currently lack specialist staff. The regulatory hurdles are formidable, but the clinical incentive is real.

Agriculture is another frontier. Harvesting robots struggle with the extraordinary variability of fruit and vegetables: different sizes, unexpected angles, stems tangled with foliage. World models trained on vast libraries of plant behaviour — how a tomato plant bends under load, how ripeness correlates with surface texture under different lighting — could transform the economics of mechanical harvesting, particularly as rural labour shortages intensify across rich countries.

In construction, world models could allow autonomous machinery to navigate the organised chaos of a building site, where the environment changes daily, obstacles are unpredictable and human workers move through the space unpredictably. Several large contractors in Japan — a country simultaneously at the frontier of robotics and burdened by an acute construction-worker shortage — are already piloting such systems.

The most visible near-term application, however, is the one with the most money behind it: autonomous vehicles. The central problem of self-driving has always been the long tail — the improbable, dangerous scenarios that are encountered rarely in real life but must be handled safely regardless. World models offer a solution: generate millions of synthetic edge cases, validate them against physics-based simulation and train the driving stack against them. Waymo's adoption of Genie 3 for exactly this purpose is the clearest sign yet that the technology has graduated from research curiosity to engineering tool.

"The central problem of self-driving has always been the long tail — the scenarios encountered rarely in real life, but which must be handled safely regardless."

FORMIDABLE OBSTACLES REMAIN

For all the excitement, significant hurdles stand between today's world models and the transformative applications their advocates envision. The first is physical fidelity. Current models can generate convincing video of objects moving through space, but their internal representations of physics remain imperfect. They can be fooled by unusual materials, extreme temperatures or interactions that fall outside their training distribution. A robot relying on such a model in a real factory would need extensive safeguards.

The second problem is data. Training a world model requires vast quantities of high-quality video of the physical world — not text scraped from the internet, but footage of objects being manipulated, materials being stressed, machines operating under varied conditions. This data is expensive to collect and difficult to label. Synthetic generation, the approach Nvidia's Cosmos platform takes, offers a partial solution but introduces its own risks: a model trained on synthetic physics will inherit whatever shortcuts and approximations the simulator made.

The third obstacle is sample efficiency. Human children learn an approximate physics of the world in their first year of life, on the basis of relatively little observation. Today's world models require far more data to achieve far shallower understanding. Closing that gap — building systems that generalise rapidly from limited experience — remains one of the hardest open problems in machine learning.

Finally, there is the question of integration. A world model that sits in a server room is not yet a robot that can act on the world. Connecting simulation to control — ensuring that a model's internal predictions are fast enough, accurate enough and reliable enough to guide real-time physical action — is an engineering challenge that has barely begun to be addressed.

A MATTER OF TIME

Demis Hassabis, the chief executive of Google DeepMind, has described world models, alongside memory architectures and long-horizon planning, as the most important remaining obstacles on the path to artificial general intelligence. He is not alone in that assessment. What distinguishes this moment from earlier periods of enthusiasm about physical AI is the convergence of several enabling factors simultaneously: cheap compute, large multimodal datasets, improved video-generation architectures and the commercial pull of industries desperate for automation.

The Pokémon Go example is instructive in its modesty. Niantic, the game's maker, is using billions of images collected by players over a decade to build fragments of a world model that could, eventually, help delivery robots navigate city streets. It is a long way from folding laundry. But it illustrates the accretion of capability that, in AI, has a habit of producing discontinuous jumps.

Not everyone is persuaded the world-model frame is the right one. Some researchers argue that the real bottleneck is not the absence of an internal world representation but the absence of grounded embodied experience — that no amount of video watching will substitute for the proprioceptive feedback loop that develops when an agent actually manipulates objects. On this view, world models are a necessary but insufficient condition for physical AI.

That debate will take years to resolve. In the meantime, the money, the talent and the competitive pressure are all pointing in the same direction. If the current bets pay off, the era in which AI was primarily a phenomenon of screens and text will come to seem, in retrospect, like a brief and narrow opening act. The physical world awaits.

Helen Mirren once said: Before you argue with someone, ask yourself.......

Helen Mirren once said: Before you argue with someone, ask yourself, is that person even mentally mature enough to grasp the concept of a different perspective. Because if not, there's absolutely no point. Not every argument is worth your energy. Sometimes, no matter how clearly you express yourself, the other person isn’t listening to understand—they’re listening to react. They’re stuck in their own perspective, unwilling to consider another viewpoint, and engaging with them only drains you. There’s a difference between a healthy discussion and a pointless debate. A conversation with someone who is open-minded, who values growth and understanding, can be enlightening—even if you don’t agree. But trying to reason with someone who refuses to see beyond their own beliefs? That’s like talking to a wall. No matter how much logic or truth you present, they will twist, deflect, or dismiss your words, not because you’re wrong, but because they’re unwilling to see another side. Maturity is...

Notes of Nagesh

Search This Blog