Recent models can reason about the visual contents of images. These models can “think aloud” about the meaning and relationships between objects. This capability enables more effective recognition of signs and other visual information, including their contextual information within the image. How might this capability further visual analysis, interpretation, and distant viewing?

  • Awesome-Large-Multimodal-Reasoning-Models

bit.ly/visual-reasoning¶

No description has been provided for this image

Training data for visual reasoning¶

example from LLaVA

No description has been provided for this image
Question:

What is the name of the colony shown?
A. Virginia
B. Connecticut
C. Georgia
D. North Carolina

<SUMMARY> The question requires identifying a specific colony on a map of the thirteen British colonies in North America. The task involves analyzing the geographical location and shape of a highlighted region on the map and comparing it to the known locations and shapes of the four provided options: Virginia, Connecticut, Georgia, and North Carolina.</SUMMARY>

<CAPTION>The image displays a simplified map of the eastern seaboard of North America. The thirteen colonies are shown, with one colony highlighted in dark green. The map includes the Atlantic Ocean and parts of the Great Lakes region. The highlighted colony is located in the southeastern region of the map, bordering other light-green colored colonies.</CAPTION>

<REASONING>To accurately determine the highlighted colony, a detailed analysis of the map is necessary. The steps are as follows:

  1. Regional Identification: The map's focus is the eastern coastline of North America, showcasing the thirteen colonies. The highlighted area resides in the southeastern part of this region, close to the Atlantic Ocean. This eliminates Connecticut, which is significantly north.

  2. Shape and Boundaries Analysis: The highlighted colony's shape is crucial. We observe a roughly triangular shape that tapers towards the south and extends along a significant portion of the coastline. This distinctive shape is key to accurate identification.

  3. Comparative Analysis of Options:

    • Virginia: While in the southeastern region, Virginia's shape is significantly different. It is longer and more slender, extending further north and lacking the distinct triangular form of the highlighted colony.
    • Connecticut: Connecticut is located far north of the highlighted region and can be immediately ruled out.
    • Georgia: Georgia, while in the southeastern region, is located south of the highlighted area. It has a different shape and does not share the boundaries displayed on the map.
    • North Carolina: North Carolina's geographical location and shape precisely match the highlighted area on the map. Its triangular form, coastal length, and neighboring colonies align perfectly with the visual representation.
  4. Elimination and Confirmation: By comparing the shape and location of the highlighted area with the four provided options, we can conclusively eliminate Virginia, Connecticut, and Georgia. The shape and position strongly confirm the highlighted colony as North Carolina.</REASONING>

<CONCLUSION>Based on the detailed analysis of the map's highlighted area and comparison with the provided options, the colony shown is North Carolina. Therefore, the correct answer is D.</CONCLUSION>

Demo of Qwen3-VL¶

https://huggingface.co/spaces/Qwen/Qwen3-VL-Demo

  • download the map image above and test with Qwen3
  • compare with thinking and without thinking

More:
Biology example (malaria microscopy)

Qwen3-VL technical paper "The majority of our multi-modal reasoning data consists of over 60 million K–12 and undergraduate-level exercises, meticulously curated through a rigorous cleaning and reformulation pipeline."
See also: Liv G. d'Aliberti & Manoel Horta Ribeiro, "The Illusion of Insight in Reasoning Models" https://arxiv.org/abs/2601.00514

Multi-image reasoning¶

In [ ]:
# @title Fetch IIIF manifest for [Photographs](https://eap.bl.uk/archive-file/EAP699-23-1)
import requests
manifest_url = "https://eap.bl.uk/archive-file/EAP699-23-1/manifest"

header = {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/11"
  }
manifest = requests.get(manifest_url, headers=header)
if manifest.status_code != 200:
    raise Exception(f"Error downloading manifest: {manifest.status_code}")

manifest = manifest.json()
In [ ]:
images = []
for item in manifest['items']:
  image_uri = item['items'][0]['items'][0]['body']['id']
  images.append(image_uri)
images
In [ ]:
from google.colab import userdata

base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
api_key= userdata.get('DASHSCOPE_API_KEY') # We will provide you with an API key for the workshop
model = "qwen3-vl-plus" #or "qwen3-vl-flash" (faster and cheaper), "qwen-vl-ocr" (for typewritten texts) For a full list of available models, see https://www.alibabacloud.com/help/en/model-studio/models
In [ ]:
prompt = """
This is a series of photographs. Think step-by-step to identify common elements and connections between the images. Give a detailed explanation of your reasoning and a summary of your findings.
"""
In [ ]:
import os
import asyncio
import aiohttp
from openai import OpenAI
from tqdm.asyncio import tqdm
import requests
import base64
from mimetypes import guess_type
from IPython.display import Markdown as md

# Function to download an image from a URL and return its base64 data URL representation
def url_to_base64_data_url(image_url):
    try:
        # Add a user-agent header to mimic a browser, which can sometimes help with access
        response = requests.get(image_url, stream=True)
        response.raise_for_status()  # Raise an exception for bad status codes

        image_data = response.content

        # Try to get the mime type from the response headers; fallback if not available
        mime_type = response.headers.get("Content-Type")
        if not mime_type:
            # Attempt to guess mime type from content if header is missing
            # This is a basic guess and might not be perfect for all image types
            if image_data.startswith(b'\x89PNG'):
                mime_type = 'image/png'
            elif image_data.startswith(b'\xff\xd8'):
                mime_type = 'image/jpeg'
            elif image_data.startswith(b'GIF89a') or image_data.startswith(b'GIF87a'):
                mime_type = 'image/gif'
            else:
                mime_type = 'application/octet-stream' # Default fallback

        base64_encoded_data = base64.b64encode(image_data).decode("utf-8")

        return f"data:{mime_type};base64,{base64_encoded_data}"
    except requests.exceptions.RequestException as e:
        print(f"Error downloading image from {image_url}: {e}")
        return None

# Use AsyncOpenAI instead of OpenAI
client = OpenAI(
    api_key=api_key,
    base_url=base_url,
)

# Prepare the image content with base64 encoded images
image_contents = []
for img_url in images:
    base64_image_data_url = url_to_base64_data_url(img_url)
    if base64_image_data_url:
        image_contents.append({"type": "image_url", "image_url": {"url": base64_image_data_url}})
    else:
        # Handle cases where image download failed, e.g., skip or raise an error
        print(f"Skipping image {img_url} due to download error.")

# Add the text prompt at the end
image_contents.append({"type": "text", "text": prompt})

response = client.chat.completions.create(
    model="qwen3-vl-plus", # Or qwen3-vl-long
    messages=[
        {
            "role": "user",
            "content": image_contents
        }
    ]
)
md(response.choices[0].message.content)

Also works with Dermatology to detect changes between images.

Thinking and Text Recognition (OCR + HTR)¶

https://huggingface.co/numind/NuMarkdown-8B-Thinking

Demo: https://huggingface.co/spaces/numind/NuMarkdown-8B-Thinking
Test with: https://github.com/apjanco/files/blob/master/EAP1477_MFC_B04_Doc10_UrrutiavsCarrasco_IMG_014(1).jpg?raw=true


Gemini 3.0 Using AI Sandbox¶

In [ ]:
%pip install -q portkey-ai
In [ ]:
from google.colab import userdata
from portkey_ai import Portkey


AI_SANDBOX_KEY = userdata.get('SANDBOX_KEY')

client = Portkey(api_key=AI_SANDBOX_KEY)
In [ ]:
# @title
model_name = "gemini-3-pro-preview" # @param ["gpt-5","gpt-4o","gemini-3-pro-preview"]
In [ ]:
# Install the Portkey AI Gateway and Mimetypes packages with pip
#   pip -i portkey-ai mimetypes
#
# For more information on the SDK see https://portkey.ai/docs/api-reference/sdk/python
#
from portkey_ai import Portkey
import os
import base64
from mimetypes import guess_type
from google.colab import userdata
from IPython.display import Markdown as md

# Before executing this code, define the API Key within an enironment variable in your OS
# Linux BASH example: export PORTKEY_API_KEY=<key provided to you>

# Import API key from OS environment variables
AI_SANDBOX_KEY = userdata.get('SANDBOX_KEY')

client = Portkey(api_key=AI_SANDBOX_KEY)


# Base 64 encode local image and return text to be included in AI prompt
def local_image_to_data_url(image_path):
    """
    Get the url of a local image
    """
    mime_type, _ = guess_type(image_path)

    if mime_type is None:
        mime_type = "application/octet-stream"

    with open(image_path, "rb") as image_file:
        base64_encoded_data = base64.b64encode(image_file.read()).decode("utf-8")

    return f"data:{mime_type};base64,{base64_encoded_data}"


# This function will submit a simple prompt and image file to the chosen model
def image_prompt_example(model_to_be_used, image_file, prompt):
    # Establish a connection to your Azure OpenAI instance

    try:
        response = client.chat.completions.create(
        model=model_to_be_used,
        # Prompt parameters may also be defined, depending on model capabilities
        #temperature=0.5, # temperature = how creative/random the model is in generating response - 0 to 1 with 1 being most creative
        #max_tokens=1000, # max_tokens = token limit on context to send to the model
        #top_p=0.5, # top_p = diversity of generated text by the model considering probability attached to token - 0 to 1 - ex. top_p of 0.1 = only tokens within the top 10% probability are considered
        messages=[
        {"role": "user", "content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": local_image_to_data_url(image_file)}}]}, # user prompt
               ]
        )
        return md(response.choices[0].message.content)

    except Exception as e:
        print(e.message)

Gemini As Indiana Jones: How Gemini 3.0 Deciphered The Mystery Of A Nuremberg Chronicle Leaf's 500-Year-Old Roundels¶

GDelt Project blog post

In [ ]:
!wget https://blog.gdeltproject.org/wp-content/uploads/16187-scaled.jpg
In [ ]:
from PIL import Image
Image.open("16187-scaled.jpg")
In [ ]:
prompt = """Attached are two pages from the Nuremberg Chronicle.
At the bottom of the page a previous owner of the book drew four circles and wrote Latin text in them.
It was probably written sometime from 1493 to 1600s for context.
I've also attached zoomed up images of the four circles.
For each of them, transcribe the Latin and translate into English and tell me its meaning and importance.
They relate to the text in the page, so use the page text to decipher and contextualize them."""
In [ ]:
image_prompt_example(model_name,"16187-scaled.jpg",prompt)

Build a Reasoning Model (From Scratch)¶

by Sebastian Raschka

In [ ]: