ai next growth

From LLMs to LMMs: How Large Multimodal Models are Giving Robots Common Sense

From LLMs to LMMs: How Large Multimodal Models are Giving Robots Common Sense

⚡ Quick Answer

Large Multimodal Models (LMMs) evolve beyond text to integrate visual, auditory, and sensory data. This synthesis allows robots to interpret physical contexts, predict outcomes, and navigate complex environments—effectively granting them the “common sense” required for autonomous, real-world interaction.


  • The Shift: Transitioning from Large Language Models (LLMs) to Large Multimodal Models (LMMs) bridges the gap between digital reasoning and physical action.
  • Sensory Integration: LMMs process video, depth, and touch, allowing robots to understand “why” and “how” rather than just following rigid code.
  • Embodied AI: By grounding language in visual reality, LMMs solve the “symbol grounding problem,” enabling robots to interact with objects intuitively.
  • Impact: Automation moves from repetitive factory tasks to dynamic domestic and industrial environments.

The Evolution: Beyond the Textual Void

For years, Large Language Models (LLMs) like GPT-4 demonstrated breathtaking intellectual prowess, yet remained “brains in a vat.” They could write poetry about a cup of coffee but couldn’t identify one on a cluttered table. The transition to Large Multimodal Models (LMMs) represents the most significant leap in artificial intelligence: the gift of sight and spatial reasoning.


LMMs are trained on diverse datasets—incorporating images, video, and tactile sensor data—enabling them to correlate the word “fragile” with the visual properties of glass and the physical delicacy required to hold it. This correlation is what we define as robotic common sense.


Solving the Symbol Grounding Problem

In traditional AI, a robot might know that a “hammer” is a tool used for nails, but it lacks the visual-spatial understanding to find a hammer in a messy toolbox. LMMs solve the Symbol Grounding Problem by mapping semantic concepts directly onto visual representations. When an LMM-powered robot sees a spill, it doesn’t just see a change in pixels; it understands “mess,” “liquid,” and the subsequent need for a “towel.”


The Three Pillars of LMM Common Sense

  1. Visual Reasoning: Identifying objects within a 3D context and understanding their relationships (e.g., the cup is behind the laptop).
  2. Temporal Prediction: Anticipating what happens next. If a ball rolls toward a table edge, an LMM-equipped robot can predict the fall.
  3. Cross-Modal Mapping: Translating a natural language command (“Clean up the living room”) into a series of visual-motor actions.

From Prediction to Action: Embodied AI

The true power of LMMs is realized through Embodied AI. Unlike a chatbot, an embodied LMM (often referred to as a Vision-Language-Action or VLA model) outputs motor commands. This allows robots to handle tasks that were previously impossible for automation, such as folding laundry, sorting groceries, or assisting in surgical procedures where every movement requires real-time adjustment based on visual feedback.


Stay Ahead of the Robotics Revolution

Explore our deep-dive reports on the future of Embodied AI and Large Multimodal Models in industry.

Download the Whitepaper

The Future: General Purpose Robots

We are moving away from “specialist” robots programmed for a single task toward “generalist” agents. LMMs provide the cognitive architecture for robots to enter our homes and workplaces not as programmed machines, but as observant learners capable of navigating the chaos of human life with the common sense we once thought was exclusively biological.


Related Insights

Exit mobile version