
Google DeepMind has been making steady progress in the field of AI with regular updates to Gemini, Imagen, Veo, Gemma, and AlphaFold. Today, the Google DeepMind team entered the robotics industry with two new Gemini 2.0-based models: Gemini Robotics and Gemini Robotics-ER.
Gemini Robotics is an advanced vision-language-action (VLA) model which is based on Gemini 2.0, with the addition of physical actions as a new output modality to control robots. Google claims that this new model can understand situations that it has never seen before in training.
Compared to other state-of-the-art vision-language-action models, Gemini Robotics performs twice as well on a comprehensive generalization benchmark. Since Gemini Robotics is built on the Gemini 2.0 model, it features natural language understanding capabilities in different languages. So, it can understand people’s commands in a much better way.
When it comes to dexterity, Google claims that Gemini Robotics can handle extremely complex, multi-step tasks that require precise manipulation. For example, this model can perform origami folding or put a snack into a Ziploc bag.
Gemini Robotics-ER is an advanced vision-language model that focuses on spatial reasoning and allows roboticists to connect it with their existing low-level controllers. Using this model, roboticists will have all the steps to control a robot right out of the box, which includes perception, state estimation, spatial understanding, planning, and code generation.
Google is partnering with Apptronik to build humanoid robots based on Gemini 2.0 models. Google is also working with select trusted testers, including Agile Robots, Agility Robotics, Boston Dynamics, and Enchanted Tools, on the future of Gemini Robotics-ER.
By enabling robots to understand and execute complex tasks with greater precision and adaptability, Google DeepMind is paving the way for a future where robots can seamlessly integrate into various aspects of our lives.
3 Comments - Add comment