Send me a message

If you would like to get in touch then please send me a message. I will respond as soon as possible. Thank you.

Beyond Words
AI & Technology
Why VL-JEPA Signals the Next Phase of AI Intelligence

Beyond Words

18 Dec 2025

For the past few years, artificial intelligence has been defined almost entirely by language.

Large Language Models (LLMs) astonished us by writing essays, passing exams, generating code, and holding conversations that felt, at times, uncannily human. It’s easy to assume that language is intelligence.

But language is not intelligence.
It is one expression of it.

A recent announcement from Meta’s AI research team, led by Yann LeCun and Pascale Fung, quietly points to something deeper: a shift away from word-prediction machines and towards AI systems that understand the world directly. The architecture is called VL-JEPA , Vision-Language Joint Embedding Predictive Architecture.

And it may mark the beginning of the post-language phase of AI.

The Limits of Language-First AI

At their core, LLMs do one thing exceptionally well:
they predict the next token in a sequence.

Even when they “see” images or video, those inputs are first converted into tokens, flattened into language-like representations, and processed step-by-step. Intelligence emerges from statistical patterns across vast corpora of human text.

This approach is powerful,but also inefficient and fragile.

LLMs:

Think in sentences, not states
Generate one token at a time (slow and costly)
Struggle with real-time perception
Lack an internal model of how the physical world behaves
In other words, LLMs are brilliant narrators , but poor observers.

What VL-JEPA Does Differently

VL-JEPA does not generate language.

Instead, it learns a shared latent representation of vision and language and uses that representation to predict future states of the world. It operates in abstract space, not word space.

This distinction matters.

Rather than asking:

“What word comes next?”

VL-JEPA asks:

“What should happen next?”

It predicts embeddings, not sentences.
Meaning emerges without narration.

This makes VL-JEPA:

  • Non-autoregressive (no token-by-token decoding)
  • Faster and more efficient
  • Naturally suited to streaming video
  • Ideal for wearables, robotics, and embodied AI
  • Which is why Meta explicitly links it to AI assistants in smart glasses.

From Language Models to World Models

This shift echoes a long-standing belief held by Yann LeCun:
intelligence is grounded in predictive models of the world, not in language alone.

Humans don’t continuously narrate reality to function within it. We perceive, anticipate, and act , mostly without words.

VL-JEPA brings AI closer to that mode of cognition.

It understands:

  • Motion
  • Objects
  • Actions
  • Cause and effect
  • All without generating a single sentence.

Language, in this architecture, becomes optional, not foundational.

 

Why This Matters

The excitement around generative AI has overshadowed a quieter truth: generation is expensive.

Most real-world intelligence doesn’t require eloquence. It requires:

  • Speed
  • Continuity
  • Anticipation
  • Low latency
  • Low power consumption
  • Smart glasses don’t need essays.
  • Robots don’t need poetry.
  • Autonomous systems don’t need to “explain” themselves every second.

They need situational awareness.

VL-JEPA provides exactly that.

The Future Is Hybrid, Not Competitive.This is not an “LLMs vs VL-JEPA” story, It’s a layered future. A likely architecture looks like this:

  • World Model (VL-JEPA)
  • Silent, continuous, predictive perception
  • Language Model (LLM)
  • Reflection, explanation, creativity, ethics, dialogue
  • In this stack, LLMs sit on top of world understanding , not at its foundation.

Language becomes a lens, not the engine. Beyond Language, Towards Intelligence, we are entering an era where AI systems will:

  • See without speaking
  • Understand without explaining
  • Act without narrating
  • This doesn’t diminish the importance of language , it contextualises it.

In Replugged, I describe technology as moving from tools that extend our hands, to tools that extend our minds. VL-JEPA suggests something even more profound: tools that begin to experience reality in ways closer to how we do.

Not through words, but through understanding.

If you’re interested in how this shift connects to AGI, embodied intelligence, and the deeper philosophical questions around machine “understanding,” these ideas are explored further in Chapter 6 of Replugged: From Mainframes to Mindframes.

Comments

There are no comments yet, be the first to comment...

Leave a comment

Your comment will first need to be approved before it is visible.