Google launched Ask Photos, which will enable users to search for native photos and videos in natural language.

Brain Titan
4 min readMay 18, 2024

--

Ask Photos is an upcoming experimental feature in Google Photos that leverages the AI model Gemini to enable users to search for photos and videos using natural language questions and assist with related tasks.

  • Ask Photos allows users to search for photos and videos using natural language questions.
  • For example, a user could ask, ‘Where did you go camping last year’ or ‘When do my vouchers expire’ and Ask Photos will find relevant photos and information.

Key features of Ask Photos:

  1. Natural Language Search: Ask Photos allows users to use natural language questions to search for photos and videos. Instead of having to remember specific keywords or shooting dates, users can simply ask, ‘What country did we camp in last year?’ or ‘When was my child’s first birthday?’ Ask Photos will find relevant photos based on these questions.
  2. Contextual Understanding and Detail Extraction: The Gemini AI model understands the context and subject matter of a photo and extracts the details. For example, a user could ask, ‘What did our Christmas decorations look like in years past?’ Ask Photos will analyze the contextual decorations, scenes, and other details in the photo and provide relevant results.
  3. Task Assistance: Ask Photos not only helps users search for photos, but also assists with a variety of tasks:
  • Creating Travel Highlights: Users can ask Ask Photos to help create a collection of travel photos, and with a simple ask, they can get Featured photos and personalized sharing text.
  • Write personalized social media shares: Ask Photos can generate personalized descriptions based on the content of the photos, making it easy for users to share them on social media.
  1. Multimodal Capabilities: Gemini’s multimodal capabilities allow it to process and understand complex information in photos, including text, scenes, and people. For example, a user could ask, ‘What are the themes of Lena’s birthday party?’ Ask Photos analyzes the birthday cake, decorations, and other background details to answer this question.
  2. Dynamic Adjustment and Learning: Ask Photos can dynamically adjust and learn based on user feedback. If the user corrects or provides additional information, Ask Photos remembers those details to provide more accurate results in future searches and tasks.

The working mechanism behind the Ask Photos feature can be broken down into three main steps: understanding the problem, generating a response, and securing and memorizing corrections. Here’s a detailed explanation:

1. Understanding the Question

Ask Photos first understands the user’s query and forms a plan to find the answer.

  • Parsing the query: Utilizing natural language processing technology, Ask Photos is able to parse natural language questions entered by the user, identifying relevant keywords such as places, people, and dates, as well as natural language concepts such as ‘themed birthday party ‘.
  • Forming a search plan: Based on the parsing results, Ask Photos generates a search plan that identifies the specific information to be found.

2. Generating a Response

Understanding the problem, Ask Photos generates a response in multiple steps:

  • Analyzing Search Results: The search results are analyzed to determine which photos and videos are most relevant and which appear to best match the user’s query.
  • Multimodal Capabilities: Leveraging Gemini’s multimodal capabilities, Ask Photos understands exactly what’s happening in each photo, and can even read the text in the image (if desired).
  • Building a Response: Based on the results of the analysis, Ask Photos generates a detailed and useful response, selecting and returning the photos and videos that best meet the user’s needs.

3. Ensuring security and remembering corrections

Throughout the process, Ask Photos took multi-layered steps to ensure that the response was secure and appropriate, and that it would remember the user’s corrections.

  • Security: Even though Ask Photos is an experimental feature that doesn’t guarantee that all responses will be completely correct, Google employs multiple layers of security measures and AI modeling to ensure that responses are safe and appropriate.
  • Memory corrections: If a user corrects an answer or provides additional information, Ask Photos remembers those details to provide a more accurate response to future queries.

Working schematic

  1. User input question:
  • User: Where did I camp last year?
  1. Understanding the question:
  • Parses the query, recognizing the keywords ‘last year’ and ‘camping’.
  • Form a search plan to find relevant photos.
  1. Generate a response:
  • Analyze the search results and select the most relevant camping photos.
  • Use Gemini’s multimodal capabilities to understand the scene and details in the photo.
  • Construct and return a detailed response with the photos and information that best match the query.
  1. Ensuring Security and Memory Correction:
  • Employing security measures to ensure the appropriateness of the response.
  • Remember user feedback and corrections to improve future response accuracy.

Original post: https://blog.google/products/photos/ask-photos-google-io-2024/

--

--