Understanding Hand-Drawn Images

Learn how our pictionary bot understands hand-drawn images and evaluates them using the image-to-text models in Gemini. Also, understand how images can be sent as prompts to Google Gemini.

Behind the scenes

Multimodal models such as Gemini can work with images as input. This enables them to analyze an image and generate a textual description of its content. Here’s a brief overview of how most image captioning models work:

  • The image is first processed into a format that is easily digestible for the model.

  • A CNN encoderA Convolutional Neural Network (CNN) encoder specializes in processing images/videos. It acts like a data summarizer, transforming raw visuals into a compact representation that captures the essence of the content. analyzes the image, extracting features like edges, objects, and their spatial relationships. This creates a compressed representation of the image’s visual content.

  • The decoderDecoders are like translators. They take a condensed version of information, created by an encoder, and turn it back into something we can understand. receives the encoded image representation from the CNN. It starts generating words one at a time. At each step, it considers the previous words it generated, the encoded image representation, and its internal knowledge of language.

  • The decoder predicts the next word in the caption based on the accumulated information. This process continues until a stopping sequence is generated or a maximum length is reached.

Prompts with images

Earlier in the “Multimodal Prompting with Google Gemini” lesson, we used the generate_content() method to send an image as a prompt. Here’s a quick refresher on how to use an image in a prompt. The playground below has been set up to allow you to upload an image:

Get hands-on with 1200+ tech skills courses.