With two instantiations of VLAs based on PaLM-E and PaLI-X, RT-2 results in highly-improved robotic policies, and, more importantly, leads to significantly better generalisation performance and emergent capabilities, inherited from web-scale vision-language pre-training. RT-2 shows that vision-language models (VLMs) can be transformed into powerful vision-language-action (VLA) models, which can directly control a robot by combining VLM pre-training with robotic data. Thanks to its VLM backbone, RT-2 can also plan from both image and text commands, enabling visually grounded planning, whereas current plan-and-act approaches like Sa圜an cannot see the real world and rely entirely on language. With this process, RT-2 can perform more involved commands that require reasoning about intermediate steps needed to accomplish a user instruction. Commands such as “pick up the bag about to fall off the table” or “move banana to the sum of two plus one” – where the robot is asked to perform a manipulation task on objects or scenarios never seen in the robotic data – required knowledge translated from web-based data to operate.Ĭhain-of-thought reasoning enables learning a self-contained model that can both plan long-horizon skill sequences and predict robot actions. Exploring RT-2’s emergent capabilities, we first searched for tasks that would require combining knowledge from web-scale data and the robot’s experience, and then defined three categories of skills: symbol understanding, reasoning, and human recognition.Įach task required understanding visual-semantic concepts and the ability to perform robotic control to operate on these concepts. We performed a series of qualitative and quantitative experiments on our RT-2 models, on over 6,000 robotic trials. The resulting model takes in robot camera images and directly predicts actions for a robot to perform. RT-2 architecture and training: We co-fine-tune a pre-trained VLM model on robotics and web data. We address this challenge by representing actions as tokens in the model’s output – similar to language tokens – and describe actions as strings that can be processed by standard natural language tokenizers, shown here: To control a robot, it must be trained to output actions. In our work, we adapt Pathways Language and Image model ( PaLI-X) and Pathways Language model Embodied ( PaLM-E) to act as the backbones of RT-2. Such VLMs have been successfully trained on web-scale data to perform tasks, like visual question answering, image captioning, or object recognition. RT-2 builds upon VLMs that take one or more images as input, and produces a sequence of tokens that, conventionally, represent natural language text. We also show that incorporating chain-of-thought reasoning allows RT-2 to perform multi-stage semantic reasoning, like deciding which object could be used as an improvised hammer (a rock), or which type of drink is best for a tired person (an energy drink). This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions. RT-2 shows improved generalisation capabilities and semantic and visual understanding beyond the robotic data it was exposed to. More specifically, our work used RT-1 robot demonstration data that was collected with 13 robots over 17 months in an office kitchen environment. This work builds upon Robotic Transformer 1 (RT-1), a model trained on multi-task demonstrations, which can learn combinations of tasks and objects seen in the robotic data. A visual-language model (VLM) pre-trained on web-scale data is learning from RT-1 robotics data to become RT-2, a visual-language-action (VLA) model that can control a robot.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |