Multimodal deep SensoriMotor Representation learning (MeSMRise)

Funded by :ANR
Funding :511 262€
Period :2023-2027
Status :In progress
Coordinator :Mathieu Lefort
Collaborators :Frédéric Armetta, Stefan Duffner, Laëtitia Matignon, Alexandre Devillers (LIRIS), Jean-Charles Quinton, Xavier Alameda-Pineda (LJK), Arthur Aubret, Jochen Triesch, Céline Teulière (Institut Pascal)



Description

There is "a fundamental misalignment between human and typical AI representations: while the former are grounded in rich sensorimotor experience, the latter are typically passive and limited to a few modalities such as vision and text" (Hay et al, 2016). We propose in this project to take inspiration from the way human babies learn to explore their environment through actions that shape their multimodal experience. Especially, the sensorimotor contingencies (SMC) theory combines coherent pieces of evidence from neuroscience, psychology, etc. of human perception and learning in a unified framework. The key claims are learning of SMCs defined as “the structure of the rules governing the sensory changes produced by various motor actions” (O'Regan et al, 2001) and active perception as the “organism’s exploration of the environment that is mediated by knowledge of SMCs” (Myin et al, 2002). Some models implementing this theory are able to learn complex concepts such as containment for instance. Inspired by the SMC theory, the main objective of the project is to study how action can structure the multimodal representations, learned with self-supervised learning methods. This will be applied to 3D objects, perceived by vision and point cloud, and manipulated in virtual environments. By proposing a new paradigm for unsupervised representation learning with multimodal data fusion, it will participate to multiple issues of the E.2 axis. Specifically, we target the following properties: - generalization to unknown environments and contexts - robustness, e.g. to the orientation, background, shape... of the object - adaptability via the capacity of the model to autonomously find relevant information - generality by using similar architectures and principles for all research questions.