Late last year, Google launched its first natively multimodal model, Gemini 1.0, in three sizes: Ultra, Pro, and Nano. Just a few months later, the release of 1.5 Pro followed, featuring enhanced performance. This inspired further innovation, leading to the introduction of Gemini 1.5 Flash, a lighter-weight model than 1.5 Pro, designed to be fast and efficient to serve at scale.
The company on Tuesday hosted its annual I/O developer conference, and rolled out its latest artificial intelligence products, from search and chat features to AI assistants. Here's a look at the developments.
Gemini 1.5 Flash
Beginning with the 1.5 Flash, it is the newest addition to the Gemini model family and the fastest Gemini model served in the API. It’s optimized for high-volume, high-frequency tasks at scale, is more cost-efficient to serve, and features a breakthrough long context window. The 1.5 Flash excels at summarization, chat applications, image and video captioning, data extraction from long documents and tables, and more.
Over the last few months, significant improvements have been made to 1.5 Pro, the best model for general performance across a wide range of tasks. The 1.5 Pro can now follow increasingly complex and nuanced instructions, including ones that specify product-level behaviour involving role, format, and style.
Moving further, the Gemini Nano is expanding beyond text-only inputs to include images as well. Starting with Pixel, applications using Gemini Nano with Multimodality will be able to understand the world the way people do — not just through text, but also through sight, sound, and spoken language. Read more about Gemini 1.0 Nano on Android.
Project Astra
As part of Google DeepMind’s mission to build AI responsibly to benefit humanity, progress is being shared in building the future of AI assistants with Project Astra (advanced seeing and talking responsive agent).
To be truly useful, an agent needs to understand and respond to the complex and dynamic world just like people do — and take in and remember what it sees and hears to understand the context and take action. It also needs to be proactive, teachable, and personal, allowing users to talk to it naturally and without lag or delay.
Gemini for Workspace
Moving further, since people are always searching their emails in Gmail, Google is working to make it much more powerful with Gemini. So for example, as a parent, you want to stay informed about everything that’s going on with your child’s school. Gemini can help you keep up.
While incredible progress has been made in developing AI systems that can understand multimodal information, achieving conversational response time remains a difficult engineering challenge. Over the past few years, efforts have been made to improve how models perceive, reason, and converse to make the pace and quality of interaction feel more natural.
Gemini for Android
With billions of Android users worldwide, they are excited to integrate Gemini more deeply into the user experience. As a new AI assistant, Gemini is there to help users anytime, anywhere. They have incorporated Gemini models into Android, including their latest on-device model: Gemini Nano with Multimodality, which processes text, images, audio, and speech to unlock new experiences while keeping information private on the device.
Google Search
One of Google's greatest areas of investment and innovation is in their founding product, search. Twenty-five years ago, they created search to help people make sense of the waves of information moving online. On mobile, they unlocked new types of questions and answers by using better context, location awareness, and real-time information. With advances in natural language understanding and computer vision, they enabled new ways to search—whether using a voice, or a hum to find a new favourite song or an image of a flower seen on a walk. Now, users can even Circle to Search for those cool new shoes they might want to buy, with the reassurance of easy returns.
Of course, search in the Gemini Era will take this to a whole new level, combining its infrastructure strengths, the latest AI capabilities, its high bar for information quality, and its decades of experience connecting users to the richness of the web. The result is a product that does the work for the user. Google Search is generative AI at the scale of human curiosity, marking the most exciting chapter of Search yet.
Audio Outputs
Audio Overviews in NotebookLM provide progress updates by utilizing Gemini 1.5 Pro to analyze source materials and create a tailored, interactive audio dialogue. This highlights the potential of multimodality. Shortly, users will have the capability to combine various input and output methods. This represents an I/O system tailored for a new generation. But what if there were opportunities to push these boundaries even further?
Google Photos
One example is Google Photos, which they launched almost nine years ago. Since then, users have utilized it to organize their most cherished memories, resulting in more than 6 billion photos and videos uploaded every single day.
People adore using Photos to search across their life, and with Gemini, they are making that process much simpler. For instance, imagine someone is at a parking station but can't recall their license plate number. Previously, they would have to search Photos for keywords and sift through years' worth of photos to find license plates. Now, they can simply ask Photos. The platform recognizes frequently appearing cars, determines which one is theirs, and provides the license plate number.