Objective and Scope
The integration of foundation models, such as Large Language Models (LLMs), Vision-Language Models (VLMs), Vision-Language-Action Models (VLAMs), and other multimodal architectures, has the potential to fundamentally reshape how humanoid robots perceive, reason, act, and interact in the world. Until recently, general-purpose humanoid robots were constrained by the limited scalability of traditional planning, control, and perception pipelines. These robots, designed to understand instructions, adapt to novel environments, and engage in human-like interactions, often struggled to meet the flexibility required for general use. Foundation models offer a major step forward by enabling abstract reasoning, grounding through multi-modal inputs, and fast generalization across tasks and domains. However, integrating these models into humanoid robots presents open challenges. Questions remain about grounding abstract knowledge into sensorimotor data, combining model-based control with learned representations, ensuring safety and interpretability, and enabling embodied models to interact socially and linguistically with humans. Humanoid platforms, ranging from full-body anthropomorphic robots to more abstract bimanual systems, are especially well-suited for this line of research. Their structural similarity to humans and versatility make them ideal for exploring foundation model grounding, physical reasoning, and social interaction in human environments. This workshop is designed to promote collaboration between academic researchers and industry practitioners. The program will feature keynote talks, a cross-sector panel discussion, and a call for contributed presentations, with a particular focus on emerging research at the intersection of humanoid robotics and foundation models. By fostering this convergence, the workshop aims to build a roadmap toward generalizable, human-aligned, and socially capable humanoid systems powered by foundation models.
Topics
- Models for Humanoid Robotics: Use of large pre-trained language and vision-language models for planning, decision-making, and control in humanoid systems
- Multimodal Perception and World Modeling: Leveraging VLMs for scene understanding, object grounding, spatial awareness, and physical commonsense reasoning
- Policy Learning with Foundation Models: Language-guided reinforcement and imitation learning, as well as model-based control augmented by LLMs
- Human-Robot Interaction and Social Intelligence: Social reasoning, linguistic interaction, and collaborative behavior in human-centered environments
- Safety, Interpretability, and Human Alignment: Ensuring safe, transparent, and aligned behavior in LLM- or VLM-driven humanoid systems
- Simulation, Evaluation, and Benchmarks: Tools and protocols for benchmarking embodied foundation models in humanoid tasks, including real-world and simulated settings
- Applications in Complex Environments: Household assistance, healthcare, industrial tasks, mobile manipulation, and human-facing applications involving physical and social complexity
Speakers
Organizers
Preliminary Program
09:00 - 09:30 | Welcome & Introduction |
09:30 - 10:00 | Keerthana Gopalakrishnan |
10:00 - 10:30 | Booster Presentations of Accepted Posters |
10:30 - 11:00 | ☕ Coffee Break | Poster Session |
11:00 - 11:30 | Rudolf Lioutikov |
11:30 - 12:00 | Kento Kawaharazuka | Foundation Model-based Recognition and Planning for Humanoid Robots |
12:00 - 14:00 | 🍽️ Lunch break |
14:00 - 14:30 | Joel Jang | Why do we need Humanoid Robots? Perspective with video world models |
14:30 - 15:00 | Roger Qiu | Blurring the line between humans and humanoids |
15:00 - 16:00 | ☕ Coffee Break |
16:00 - 16:45 | Discussion Panel |
16:45 - 17:00 | Closing Remarks |
17:00 | End |
Accepted Posters
Contact
For any inquiries related to the workshop, submissions, or participation, feel free to reach out to us at:
Email: dionis.totsila@inria.fr
We look forward to hearing from you!