Learning Modular Language-Conditioned Robot Policies through Attention

Yifan Zhou1, Shubham Sonawani1, Mariano Phielipp2, Heni Ben Amor1, Simon Stepputtis3,
1Arizona State University, 2Intel AI, 3Carnegie Mellon University
Interpolate start reference image.

Our proposed method is an imitation learning method for language-conditioned robot policies. It demonstrates high performance on a variety of tasks. It is able to transfer to new robots in a data-efficient manner, while still keeping a high execution performance. It also accepts adding new behaviors to an existing trained policy. Besides them, we also demonstrate the ability to learn relational tasks, where there are two objects involved in the same sentence.

This work is related to CoRL 2022 paper "Modularity through Attention" and has been published on Autonomous Robotics (2023).

Robot Executions

In Simulation

20 Real World Executions in a Row

Abstract

Training language-conditioned imitation learning policies is typically time-consuming and resource-intensive. Additionally, the resulting controllers are tailored to the specific robot they were trained on, making it difficult to transfer them to other robots with different dynamics. To address these challenges, we propose a new approach called Hierarchical Modularity, which enables more efficient training and subsequent transfer of such policies across different types of robots. The approach incorporates Supervised Attention which bridges the gap between modular and end-to-end learning by enabling the re-use of functional building blocks. In this contribution, we build upon our previous work, showcasing the extended utilities and improved performance by expanding the hierarchy to include new tasks and introducing an automated pipeline for synthesizing a large quantity of novel objects. We demonstrate the effectiveness of this approach through extensive simulated and real-world robot manipulation experiments.

Method

Overview

Interpolate start reference image.

This figure is a very typical imitation learning pipeline. The inputs are usually defined as an RGB image, the joint angles, and the language as instruction. The output is the action trajectory. With expert demonstrations, we can train an end-2-end policy network. Many works have shown success using this paradigm, and have shown a lot of great results. However, they also exhibit the characteristics of being data hungry. They usually require a lot of training data although robot data is usually hard to acquire. Therefore, this work focuses on finding ways to make language conditioned imitation learning more data efficient. We propose to still train an end-2-end model. However, by designing specific attention techniques which routes the information flow in desired manner, we created different submodules that account for different subtasks within the same network.

Modularity

Interpolate start reference image.

We implement building blocks called sub-modules that realize specialized sub-tasks. These modules are implemented in the same transformer. As an example, the EE module is assigned the task of tracking the end-effector, while the DISP module is supposed to calculate the displacement between the target object and the end-effector.

Hierarchy of Modules

Interpolate start reference image.

Since we are able to create multiple sub-modules, we can connect them to a hierarchy for desired tasks. The hierarchy shown here is a very typical example for common manipulation tasks. Each node here is a submodule and the arrow represents the attention token embeddings between attention layers. Firstly we understand the language input, from which we can find target location. After also finding end-effector location, we can then calculate the displacement between them, which in the end will contribute to the controller sub-module that outputs the final action trajectory.

Experiments

Transfer to New Robots by Finetuning Trained Modules

Interpolate start reference image.

After training on a task, we can easily finetune some modules. For example, finetuning the CTRL (motion control) module gives us the opportunities to transfer the trained policy to an unseen robot embodiment.

Extending the Hierarchy

Interpolate start reference image.

Because of the hierarchical nature, we can add new modules to the whole architecture easily. For example, we add a branch of modules for obstacle avoidance. Similarly, we detect the obstacle first, and calculate its displacement from the end-effector, which contributes to the controller module to generate the trajectory. Another case (b) is the hierarchy for relational tasks, i.e., "Put A to the right of B".

Obstacle Avoidnace

Relational Tasks

BibTeX

@article{zhou2023learning,
      title={Learning modular language-conditioned robot policies through attention},
      author={Zhou, Yifan and Sonawani, Shubham and Phielipp, Mariano and Ben Amor, Heni and Stepputtis, Simon},
      journal={Autonomous Robots},
      pages={1--21},
      year={2023},
      publisher={Springer}
    }