Loading…

Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks

Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. It is non-trivial to manually design a robot controller that combines these modalities which have very different characteristics. While deep reinforcement learning has shown success in learnin...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2019-07
Main Authors:	Lee, Michelle A, Zhu, Yuke, Zachares, Peter, Tan, Matthew, Srinivasan, Krishnan, Savarese, Silvio, Li, Fei-Fei, Garg, Animesh, Bohg, Jeannette
Format:	Article
Language:	English
Subjects:	Algorithms Clearances Computer simulation Control systems design Machine learning Representations Robots
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. It is non-trivial to manually design a robot controller that combines these modalities which have very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. In this work, we use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. Evaluating our method on a peg insertion task, we show that it generalizes over varying geometries, configurations, and clearances, while being robust to external perturbations. We also systematically study different self-supervised learning objectives and representation learning architectures. Results are presented in simulation and on a physical robot.
ISSN:	2331-8422