Google showed off Project Astra, an AI assistant with voice and visual recognition similar to GPT-4o

At the Google I/O 2024 presentation, the company showcased the virtual assistant Project Astra with artificial intelligence and visual recognition based on Google Gemini, which is currently in development. Speaking about Astra, the CEO of the experimental DeepMind lab, Demis Hassabis, mentioned that his team has always wanted to develop a universal AI agent that would be helpful in everyday life.

Project Astra is a program with main data input interfaces being a camera and voice. A person with a smartphone pointed its camera to different parts of the office and gave Astra a task: "Tell me when you see something that makes a sound." When the virtual assistant saw a speaker next to the monitor, it replied: "I see a speaker that makes sound." The demonstrator drew an arrow on the screen pointing to the top circle on the speaker and asked: "What is this part of the speaker called?" The program instantly answered: "This is a tweeter. It produces high-frequency sounds."

Then in a video, which according to Google was filmed in one take, a tester approached a cup of colored pencils below the table and asked "Give me a creative alliteration about this," to which he got the response: "Creative colored pencils are fun colors. They usually create colorful works." The video then showed how Astra identifies and explains parts of code on the monitor and informs the user of their location based on the view from the window. Astra was able to answer the question, "Do you remember where you saw my glasses?" even though they were hidden. "Yes, I do. Your glasses were on the table next to the red apple."

After that, the tester put on the glasses, and the video obtained a first-person perspective. Using the built-in camera, the glasses scanned the environment, and the gaze was directed to a diagram on the board. The person in the video asked, "What can I add here to make this system faster?" The program replied: "Adding cache between the server and the database can increase speed."

The tester looked at a couple of cats depicted on the board and asked, "What does this remind you of?" Astra replied: "Schrodinger's cat." When a plush tiger was placed next to a golden retriever and asked to name this group, Astra answered "Golden Stripes."

The demonstration proves that Astra not only processed visual data in real-time but also remembered what it saw and worked with stored information. According to Hassabis, this was achieved through faster information processing by continuous encoding of video frames, combining video and speech input with a timeline of events and caching this information for efficient use.

On the video, Astra responded fairly quickly to requests. Hassabis noted in a blog post: "While we have made incredible progress in developing AI systems that can understand multimodal information, reducing response time to conversational level is a complex engineering task." Google is also working to provide its AI with a greater range of diversity and emotional nuances.

Although Astra remains an early feature with no specific launch plans, Hassabis stated that in the future, similar assistants may be available on phones or glasses. There is currently no information on whether such glasses will become the successor to Google Glass, but the head of DeepMind noted that some of the demonstrated capabilities will be available in Google products later this year.

Source: Engadget