HumanoidsxFoundationModels

Objective and Scope

The integration of foundation models, such as Large Language Models (LLMs), Vision-Language Models (VLMs), Vision-Language-Action Models (VLAMs), and other multimodal architectures, has the potential to fundamentally reshape how humanoid robots perceive, reason, act, and interact in the world. Until recently, general-purpose humanoid robots were constrained by the limited scalability of traditional planning, control, and perception pipelines. These robots, designed to understand instructions, adapt to novel environments, and engage in human-like interactions, often struggled to meet the flexibility required for general use. Foundation models offer a major step forward by enabling abstract reasoning, grounding through multi-modal inputs, and fast generalization across tasks and domains. However, integrating these models into humanoid robots presents open challenges. Questions remain about grounding abstract knowledge into sensorimotor data, combining model-based control with learned representations, ensuring safety and interpretability, and enabling embodied models to interact socially and linguistically with humans. Humanoid platforms, ranging from full-body anthropomorphic robots to more abstract bimanual systems, are especially well-suited for this line of research. Their structural similarity to humans and versatility make them ideal for exploring foundation model grounding, physical reasoning, and social interaction in human environments. This workshop is designed to promote collaboration between academic researchers and industry practitioners. The program will feature keynote talks, a cross-sector panel discussion, and a call for contributed presentations, with a particular focus on emerging research at the intersection of humanoid robotics and foundation models. By fostering this convergence, the workshop aims to build a roadmap toward generalizable, human-aligned, and socially capable humanoid systems powered by foundation models.

Topics

Models for Humanoid Robotics: Use of large pre-trained language and vision-language models for planning, decision-making, and control in humanoid systems
Multimodal Perception and World Modeling: Leveraging VLMs for scene understanding, object grounding, spatial awareness, and physical commonsense reasoning
Policy Learning with Foundation Models: Language-guided reinforcement and imitation learning, as well as model-based control augmented by LLMs
Human-Robot Interaction and Social Intelligence: Social reasoning, linguistic interaction, and collaborative behavior in human-centered environments
Safety, Interpretability, and Human Alignment: Ensuring safe, transparent, and aligned behavior in LLM- or VLM-driven humanoid systems
Simulation, Evaluation, and Benchmarks: Tools and protocols for benchmarking embodied foundation models in humanoid tasks, including real-world and simulated settings
Applications in Complex Environments: Household assistance, healthcare, industrial tasks, mobile manipulation, and human-facing applications involving physical and social complexity

Speakers

Jan
Peters
TU Darmstadt | DFKI

Keerthana
Gopalakrishnan
Google Deepmind

Kento
Kawaharazuka
The University of Tokyo

Xiaolong
Wang
UC San Diego

Ri-Zhao (Roger)
Qiu
UC San Diego

Joel
Jang
Nvidia GEAR Lab

Rudolf
Lioutikov
KIT

Organizers

Dionis
Totsila
INRIA

Serena
Ivaldi
INRIA

Tamim
Asfour
KIT

Keerthana
Gopalakrishnan
Google Deepmind

Ryan
Julian
NVIDIA GEAR

Quentin
Rouxel
CUHK (Hong Kong)

Preliminary Program

09:00 - 09:30	Welcome & Introduction
09:30 - 10:00	Keerthana Gopalakrishnan
10:00 - 10:30	Booster Presentations of Accepted Posters
10:30 - 11:00	☕ Coffee Break \| Poster Session
11:00 - 11:30	Rudolf Lioutikov
11:30 - 12:00	Kento Kawaharazuka \| Foundation Model-based Recognition and Planning for Humanoid Robots
12:00 - 14:00	🍽️ Lunch break
14:00 - 14:30	Joel Jang \| Why do we need Humanoid Robots? Perspective with video world models
14:30 - 15:00	Roger Qiu \| Blurring the line between humans and humanoids
15:00 - 16:00	☕ Coffee Break
16:00 - 16:45	Discussion Panel
16:45 - 17:00	Closing Remarks
17:00	End

Accepted Posters

Can We Detect Failures Without Failure Data? Uncertainty-Aware Runtime Failure Detection for Imitation Learning Policies

Chen Xu¹, Tony Khuong Nguyen¹, Emma Dixon¹, Christopher Rodriguez¹, Patrick Miller¹, Robert Lee²,Paarth Shah¹,Rares Andrei Ambrus¹, Haruki Nishimura¹, Masha Itkina¹

¹Toyota Research Institute (TRI), ²Woven by Toyota (WbyT)

Guiding Task and Motion Planning with Large Language Models

Ilyass Taouil¹, Michal Ciebelski¹, Victor Dhedin¹, Angela Dai¹, Majid Khadiv¹

¹ Technical University of Munich (TUM)

Preference-Based Long-Horizon Robotic Stacking with Multimodal Large Language Models

Wanming Yu¹, Adrian Röfer², Abhinav Valada², Sethu Vijayakumar¹

¹University of Edinburgh, ²University of Freiburg

ActiveGrounder: 3D Visual Grounding with Object-Hull-Guided Active Observation

Dasol Hong^*, Juhye Park^*, Taeyun Kim, Jeewon Kim, Jei Kong, Wanhee Kim, Alvin Jinsung Choi, Wooju Lee, Hyun Myung ^†

Urban Robotics Lab, KAIST, ^*Equal contribution, ^†Corresponding author

BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning

Hongyi Zhou¹, Weiran Liao¹, Xi Huang¹, Yucheng Tang¹, Fabian Otto¹, Xiaogang Jia¹, Xinkai Jiang¹, Simon Hilber¹, Ge Li¹, Qian Wang¹, Ömer Erdinç¹ Yağmurlu, Nils Blank¹, Moritz Reuss¹, Rudolf Lioutikov¹

¹Intuitive Robots Lab, Karlsruhe Institute of Technology, Germany

Contact

For any inquiries related to the workshop, submissions, or participation, feel free to reach out to us at:

Email: dionis.totsila@inria.fr

We look forward to hearing from you!