Google's RT-2 model helps robots to more easily perform actions in new situations

Google has announced a new vision-language-action (VLA) model called Robotics Transformer 2 (RT-2), which it describes as ‘a first-of-its-kind’. According to Google, RT-2 is able to take text or image inputs and output robotic actions.

The company said that training robots can be a ‘herculean effort’ because they need training on billions of points of data for each object, environment, task, and situation in the world. With RT-2, however, Google says there’s is an enormous promise for more general-purpose robots.

While the company is excited about what RT-2 can unlock, it said that a lot of work needs to be done to enable helpful robots in human-centred environments. In the end, according to DeepMind, a general-purpose physical robot could result from VLA models and they could reason, problem-solve, and interpret information for performing real-world tasks.

As the name suggests, this is not the first iteration of the Robotics Transformer VLA model. DeepMind said that RT-2 builds on the work of RT-1 and shows improved generalization capabilities compared to prior models and performs better on new, unseen tasks.

Another new skill that RT-2 was capable of over its predecessors is symbolic reasoning which means that it can understand abstract concepts and manipulate them logically. One example of this is when the robot was asked to move the bana to the sum of 2 plus 1 and performed the task correctly even though it wasn’t explicitly trained to do abstract math or symbolic manipulation.

While RT-2 is a major step forward for robotics, it wouldn’t be fair to declare that Terminator robots have arrived. The model still requires human input and oversight and experiences significant technical limitations in real-world robot operations.

With that said, it will hopefully lead to some interesting robots that can perform tasks that were not previously possible or easy to do.