Beyond the Hype: The Next Great Race in AI is How We Train Robots

Rajesh Uppal February 12, 2025 AI & IT, Photonics, Unmanned & Autonomous Systems Comments Off on Beyond the Hype: The Next Great Race in AI is How We Train Robots 11 Views

If you’ve asked ChatGPT for a recipe or watched an AI-generated video from Sora, you’ve already experienced the power of modern sequence models. These systems excel at one core task: predicting what comes next. That ability has revolutionized language, art, and media.

But as we step into robotics and autonomous systems, prediction alone is not enough. We need models that can plan, reason, and adapt to the unexpected. This shift—moving from prediction to planning—marks the next great race in AI.

The Two Titans: Next-Token Prediction and Full-Sequence Diffusion

The AI world today is shaped by two fundamentally different training methods. Each carries immense strengths but also a critical weakness, and together they frame the debate over how we might build the next generation of intelligent systems.

The first method is next-token prediction, the technology that powers large language models like ChatGPT. It works like an advanced form of autocomplete: given a sequence of words or “tokens,” it predicts the most probable next word, then the next, and so on, building coherent responses piece by piece. Its greatest strength lies in flexibility. Because it generates variable-length sequences, the same model can answer a question with a single sentence or spin out ten detailed paragraphs. This adaptability makes it invaluable for open-ended tasks such as conversation, writing, and code generation.

Yet, this very design introduces a flaw. Next-token models are inherently myopic. They excel at making local, step-by-step decisions but lack an innate sense of long-horizon planning. For robotics, this is like navigating a maze by only looking at the floor directly in front of you, without any awareness of the exit. Boston Dynamics’ robots are a good example of where this limitation shows up. They can handle step-by-step movements like running, jumping, or climbing stairs, but orchestrating those movements into complex, multi-minute tasks—such as “go to the storage room, fetch a box, and return”—requires an added layer of planning logic outside the model itself.

The second method is full-sequence diffusion, the backbone of image and video generators like DALL·E 3 and Sora. Instead of predicting outputs token by token, a diffusion model starts with pure noise and gradually refines it into a coherent final result. In this sense, it sculpts the entire sequence all at once. Its defining strength is holistic planning. Because it generates the complete outcome in one sweep, the model has a built-in awareness of the future. You can ask it to produce a video of a wolf gazing at the aurora borealis, and it works backwards from that clear goal to denoise the sequence until the vision materializes.

This kind of “goal awareness” is reminiscent of how self-driving cars plan routes. A vehicle knows the destination and can generate a broad path through traffic lights, turns, and intersections. But unlike a diffusion model, which locks itself into a fixed output, real-world driving demands continuous adaptation. If a pedestrian suddenly steps into the road or another driver cuts across the lane, the car must change its plan instantly. And here is where diffusion’s rigidity becomes a bottleneck: it’s superb for fixed, well-defined goals, but brittle when the environment changes unexpectedly.

The Trade-Off in Robotics and Vision

When these methods are applied to robotics or computer vision, their contrasting strengths create a dilemma. Next-token models offer flexibility but lack foresight, while diffusion models excel at goal awareness but lack adaptability.

In practice, what we want is a robot capable of both. Imagine a household assistant tasked with setting the dinner table. It needs long-term planning to decide where the plates, glasses, and utensils should go. But it also requires flexibility: what if one plate is chipped, or a glass slips from its grip? Balancing these competing needs defines the frontier of robotic intelligence.

The Trade-Off: A Fork in the Road for AI

When we apply these methods to robotics and computer vision, we face a fundamental trade-off:

Feature	Next-Token Prediction	Full-Sequence Diffusion
Sequence Length	Variable & Flexible	Fixed & Pre-Defined
Planning Horizon	Short-Sighted (Myopic)	Long-Horizon (Goal-Aware)
Core Strength	Adaptive, step-by-step generation	Holistic, goal-conditioned creation
Best For	Open-ended tasks (dialogue, code)	Tasks with a clear end-state (video gen.)

This creates a dilemma. We want a robot that can both plan long-term (“get the tool from the drawer in the next room”) and adapt flexibly in real-time (“oh, the drawer is stuck, I need to jiggle it first”).

The MIT Breakthrough: Diffusion Forcing

At MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), researchers have recently introduced a method that could bridge this gap. Their approach, called “diffusion forcing,” blends the holistic planning of diffusion models with the step-by-step adaptability of next-token prediction.

The idea builds on “teacher forcing,” a conventional training scheme where models learn by predicting the next token in a sequence. Diffusion forcing extends this principle by adding varying levels of noise to tokens, which the model must then “cleanse” while also predicting upcoming steps. In effect, it acts as a kind of fractional masking, giving the model the ability to both denoise a sequence and anticipate what comes next.

In experiments, diffusion forcing proved remarkably robust. A robotic arm trained with this method was able to rearrange toy fruits into precise target spots on circular mats, even when starting from random positions or when its view was partially obstructed by objects like shopping bags. Despite the distractions, the arm completed the task with consistency, showing how diffusion forcing can filter noisy inputs while maintaining reliable control.

A Human Analogy: Learning Through Noise

To understand diffusion forcing more intuitively, think about how humans learn. A pianist, for example, doesn’t play a new piece flawlessly the first time. Instead, they practice while making mistakes—wrong notes, uneven timing, or slips of the hand. Over time, they learn to filter out these “noisy” errors, gradually refining the music until the performance flows smoothly.

Diffusion forcing works in a similar way. It trains AI to recognize what parts of its input can be trusted and what parts are noise. Just as a pianist learns which notes matter in the melody, the AI learns which signals are meaningful for predicting the next steps in a task. The result is a model that can stay focused on its goal, even when distractions or uncertainty get in the way.

Beyond robotics, the method also improved AI video generation by producing smoother, more stable sequences. Researchers suggest that this dual capability could one day support a “world model”—an AI system trained on billions of internet videos to simulate real-world dynamics. Such a system could allow robots to perform novel tasks by imagining the steps needed, without requiring explicit demonstrations.

As Vincent Sitzmann, MIT professor and leader of CSAIL’s Scene Representation Group, put it: “With diffusion forcing, we are taking a step to bringing video generation and robotics closer together. In the end, we hope to use all the knowledge stored in videos on the internet to enable robots to help in everyday life.”

The Future: The Best of Both Worlds

The real breakthrough in AI may not come from choosing between diffusion models or next-token prediction, but from fusing them into something greater. Diffusion excels at long-horizon reasoning, offering foresight and structure, while next-token prediction thrives on adaptability, adjusting fluidly to each new moment. A future system that unites these strengths could think ahead like a planner yet respond instantly like an improviser.

MIT’s diffusion forcing offers a glimpse of this future. By combining the long-range planning of diffusion with the stepwise flexibility of next-token-style training, it enables robots to anticipate future actions while reacting to noisy or unpredictable inputs. For instance, a robotic arm trained with diffusion forcing can rearrange objects on mats despite visual distractions, effectively bridging high-level goals and low-level execution. This experiment demonstrates how hybrid approaches can translate abstract plans into precise, adaptable actions.

One path forward is to use diffusion for planning and next-token for control. A diffusion-inspired model could imagine a “goal video” of a task completed—a roadmap of the big picture—while a next-token-style model would translate that vision into step-by-step actions, ensuring agility in dynamic environments. Another vision is hierarchical intelligence, where AI operates on multiple timescales simultaneously. A high-level “director” would diffuse a strategy for the next minute, while a low-level “actor” handles millisecond-by-millisecond execution.

By weaving these approaches together, researchers are moving toward a new paradigm of AI—systems that don’t just react, but imagine, plan, and adapt seamlessly. MIT’s diffusion forcing shows that this hybrid future is not merely theoretical; it is already taking shape in experiments that integrate foresight with flexibility, offering a roadmap for truly intelligent robots and AI agents.

Conclusion: The Path to Truly Intelligent Agents

The debate between next-token prediction and full-sequence diffusion is not just about technical design; it’s about how we understand intelligence itself. Is intelligence best described as an adaptive, step-by-step process, or as a goal-oriented, holistic one?

The most convincing answer is that it must be both. By learning from the strengths and weaknesses of these two approaches—and by building new methods like diffusion forcing—we can chart a path toward systems that are capable of long-term vision without losing the ability to adapt in the moment.

From Boston Dynamics’ agile machines to MIT’s robotic arms and the autonomous cars navigating city streets, the trajectory is clear. The future belongs to models that can dream of a distant goal while improvising their way through the unpredictable present. The race is already underway, and the prize is nothing less than the foundation of truly intelligent agents.