Video Understanding with Vision-Language Models¶

A Deep Dive into Qwen-VL Architecture & Video Processing¶

tinyurl.com/video-understanding¶

Author: Andrew Janco
Date: April 2026
Focus: Qwen3-VL / Qwen2.5-VL architecture, video sampling, mRoPE temporal embeddings, and practical considerations


Table of Contents¶

  1. Introduction: Video Understanding as an ML Task
  2. The Qwen-VL Model Family
  3. Video Sampling & Frame Extraction
  4. From Frames to Patches: The Vision Encoder Pipeline
  5. Multimodal Rotary Position Embedding (M-RoPE)
  6. Temporal Embeddings: How Time Patches Are Created
  7. Effective Uses of VLMs for Video Understanding
  8. Challenges & Limitations
  9. References

1. Introduction: Video Understanding as an ML Task ¶

Video understanding is the task of extracting semantic meaning from video data — answering questions, describing actions, localizing events in time, and reasoning about dynamic visual scenes. Unlike image understanding, video introduces the temporal dimension: the model must reason not just about what is in a scene, but when things happen, how they change, and why transitions occur.

Why is Video Understanding Hard?¶

Challenge Description
Temporal Complexity Events unfold over time — the model must relate frames that may be seconds or minutes apart
Massive Data Volume A 30fps, 1080p video generates ~2M pixels per frame × 30 frames/sec = ~60M pixels/sec
Redundancy Adjacent frames are highly similar — naive processing wastes compute on near-duplicate information
Long-Range Dependencies Understanding a plot twist at minute 45 may require context from minute 5
Multi-Modal Reasoning Audio, text overlays, scene changes, and visual content all carry meaning

The VLM Approach¶

Modern Vision-Language Models (VLMs) tackle video understanding by:

  1. Sampling a subset of frames from the video
  2. Encoding each frame through a Vision Transformer (ViT) into visual tokens
  3. Fusing visual tokens with text tokens using positional embeddings that encode spatial and temporal information
  4. Generating text responses via a Large Language Model (LLM) backbone

This notebook focuses on how Qwen3-VL (and its predecessor Qwen2.5-VL) implements each of these steps, with particular attention to the novel Multimodal Rotary Position Embedding (M-RoPE) mechanism.


2. The Qwen-VL Model Family ¶

The Qwen Vision-Language series has evolved rapidly:

Generation Release Key Innovations
Qwen-VL Aug 2023 First VL model in the Qwen family; fixed resolution
Qwen2-VL Sep 2024 Naive Dynamic Resolution, M-RoPE, unified image/video processing, 3D convolutions
Qwen2.5-VL Jan 2025 Dynamic FPS training, absolute time encoding, window attention in ViT, SwiGLU/RMSNorm
Qwen3-VL Sep 2025 Interleaved-MRoPE, DeepStack multi-level ViT fusion, Text-Timestamp Alignment, 256K native context

Architecture Overview¶

All Qwen-VL models follow the canonical VLM pipeline:

Video/Image → Vision Encoder (ViT) → Cross-Modal Connector (MLP) → Large Language Model (Qwen LLM)

Key architectural components:

  • Vision Encoder: A ~675M parameter ViT (in Qwen2-VL/2.5-VL) trained from scratch with dynamic resolution support. Uses 2D-RoPE internally for spatial position encoding.
  • Token Compression: After the ViT, an MLP layer compresses adjacent 2×2 visual tokens into a single token (4× reduction), with <vision_start> and <vision_end> delimiter tokens.
  • 3D Convolutions: For video, 3D convolutions with temporal depth of 2 process pairs of frames as 3D tubes rather than independent 2D patches — doubling temporal coverage without increasing sequence length.
  • LLM Backbone: The Qwen2/Qwen2.5/Qwen3 language model series, receiving the interleaved visual and text tokens.

Model Sizes¶

Model ViT Params LLM Params Total
Qwen2.5-VL-3B ~675M 3B ~3.7B
Qwen2.5-VL-7B ~675M 7.6B ~8.3B
Qwen2.5-VL-72B ~675M 72B ~72.7B
Qwen3-VL-8B — 8B ~8B
Qwen3-VL-32B — 32B ~32B
Qwen3-VL-235B-A22B — 235B (22B active, MoE) ~235B

3. Video Sampling & Frame Extraction ¶

A raw video contains far too much data to process directly. The first critical step is frame sampling — selecting a representative subset of frames.

3.1 Uniform Temporal Sampling (Fixed FPS)¶

The simplest approach is to sample frames at a fixed rate. Qwen2-VL uses 2 frames per second (fps) by default:

Original Video: 30 fps × 60 seconds = 1,800 frames
Sampled at 2 fps: 2 × 60 = 120 frames
Compression ratio: 15×

This is effective for most content but can miss rapid events or waste tokens on static scenes.

In [ ]:
# Demonstration: Uniform temporal sampling
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

# Simulate a 30-second video at 30fps
original_fps = 30
duration_sec = 30
total_frames = original_fps * duration_sec  # 900 frames

# Sample at different rates
sample_rates = [1, 2, 4, 8]  # fps
colors = ['#2ecc71', '#3498db', '#e74c3c', '#f39c12']

fig, axes = plt.subplots(len(sample_rates), 1, figsize=(14, 6), sharex=True)
fig.suptitle('Uniform Temporal Sampling at Different FPS Rates', fontsize=14, fontweight='bold')

for ax, fps, color in zip(axes, sample_rates, colors):
    n_sampled = fps * duration_sec
    sampled_times = np.linspace(0, duration_sec, n_sampled, endpoint=False)

    # Show all original frames as light ticks
    all_times = np.linspace(0, duration_sec, total_frames, endpoint=False)
    ax.eventplot([all_times], colors=['#ddd'], lineoffsets=0, linelengths=0.3)

    # Show sampled frames
    ax.eventplot([sampled_times], colors=[color], lineoffsets=0, linelengths=0.8, linewidths=1.5)
    ax.set_ylabel(f'{fps} fps\n({n_sampled} frames)', fontsize=9)
    ax.set_ylim(-0.6, 0.6)
    ax.set_yticks([])
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)

axes[-1].set_xlabel('Time (seconds)', fontsize=11)
plt.tight_layout()
plt.show()

print(f"Original video: {total_frames} frames at {original_fps} fps")
for fps in sample_rates:
    n = fps * duration_sec
    print(f"  Sampled at {fps} fps: {n} frames ({total_frames/n:.0f}× reduction)")

3.2 Dynamic FPS Sampling (Qwen2.5-VL & Qwen3-VL)¶

A major innovation in Qwen2.5-VL is dynamic FPS training. Rather than fixing the sample rate, the model is trained with videos sampled at varying frame rates. This allows:

  • Adaptive temporal resolution: Fast-paced action scenes can be sampled at higher FPS, while slow/static scenes use lower FPS
  • Absolute time alignment: M-RoPE IDs are aligned directly with wall-clock time, so the model learns the pace of time through the intervals between temporal position IDs
  • Second-level event localization: Because the model understands absolute time, it can pinpoint events to specific timestamps
Static lecture (mostly still):      Sample at 0.5 fps → 30 frames/minute
Sports highlights (fast action):    Sample at 4 fps   → 240 frames/minute  
Surveillance footage (mixed):       Sample at 1 fps   → 60 frames/minute

3.3 Token Budget Management¶

To balance quality and compute, Qwen2-VL limits the total visual tokens per video to 16,384 during training. With dynamic resolution, each frame can produce a variable number of tokens, so:

$$\text{tokens\_per\_video} = \sum_{i=1}^{N_{\text{frames}}} \text{tokens}(\text{frame}_i) \leq 16384$$

The resolution of each frame is dynamically adjusted to stay within budget:

  • More frames → lower resolution per frame
  • Fewer frames → higher resolution per frame

Qwen3-VL extends this further with a native 256K context window, expandable to 1M tokens using YaRN (Yet another RoPE extensioN), enabling hour-long video comprehension.

In [ ]:
# Demonstration: Token budget allocation across frames

def compute_frame_tokens(height, width, patch_size=14, merge_factor=2):
    """Compute visual tokens for a single frame after ViT + 2x2 merging."""
    h_patches = height // patch_size
    w_patches = width // patch_size
    # After 2x2 merging
    merged_h = h_patches // merge_factor
    merged_w = w_patches // merge_factor
    return merged_h * merged_w

# Scenario: 60-second video, different sampling strategies
duration = 60  # seconds
token_budget = 16384

scenarios = {
    'High FPS, Low Res\n(4 fps, 224×224)': {'fps': 4, 'h': 224, 'w': 224},
    'Medium FPS, Medium Res\n(2 fps, 448×448)': {'fps': 2, 'h': 448, 'w': 448},
    'Low FPS, High Res\n(0.5 fps, 896×896)': {'fps': 0.5, 'h': 896, 'w': 896},
}

fig, ax = plt.subplots(figsize=(10, 5))
x_pos = np.arange(len(scenarios))
bar_colors = ['#3498db', '#2ecc71', '#e74c3c']

for i, (label, cfg) in enumerate(scenarios.items()):
    n_frames = int(cfg['fps'] * duration)
    # 3D conv merges pairs of frames
    effective_frames = n_frames // 2
    tokens_per_frame = compute_frame_tokens(cfg['h'], cfg['w'])
    total_tokens = effective_frames * tokens_per_frame

    bar = ax.bar(i, total_tokens, color=bar_colors[i], alpha=0.8, edgecolor='white', linewidth=2)
    ax.text(i, total_tokens + 300, f"{n_frames} frames\n{tokens_per_frame} tok/fr\n= {total_tokens:,} total",
            ha='center', va='bottom', fontsize=9, fontweight='bold')

ax.axhline(y=token_budget, color='red', linestyle='--', linewidth=2, label=f'Token budget ({token_budget:,})')
ax.set_xticks(x_pos)
ax.set_xticklabels(scenarios.keys(), fontsize=9)
ax.set_ylabel('Total Visual Tokens', fontsize=11)
ax.set_title('Token Budget Trade-offs: FPS vs Resolution (60s video)', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.set_ylim(0, max(20000, token_budget * 1.5))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.show()

3.4 Where Frame Sampling Happens in Code¶

Let's trace through the actual source code in qwen-vl-utils/src/qwen_vl_utils/vision_process.py to see exactly how frame sampling works.

The entry point is process_vision_info(), which is called by user code to prepare inputs for the model. For each video, it calls fetch_video(), which in turn calls a backend reader (decord, torchvision, or torchcodec). Here's the chain:

process_vision_info(messages)
  └─→ fetch_video(ele)
        ├─→ _read_video_decord(ele)   # or torchvision/torchcodec
        │     ├─→ calculate_video_frame_range(ele, total_frames, video_fps)
        │     ├─→ smart_nframes(ele, total_frames, video_fps)  ← KEY: decides HOW MANY frames
        │     └─→ torch.linspace(start, end, nframes)          ← KEY: decides WHICH frames
        └─→ resize + token budget enforcement
In [ ]:
# ============================================================================
# SOURCE CODE WALKTHROUGH: Frame Sampling in qwen-vl-utils
# From: qwen-vl-utils/src/qwen_vl_utils/vision_process.py
# https://github.com/QwenLM/Qwen3-VL/tree/main/qwen-vl-utils
# ============================================================================

# --- Global constants that control frame sampling ---
# These are the defaults set at the top of vision_process.py:

FPS = 2.0               # Default sampling rate: 2 frames per second
FRAME_FACTOR = 2         # Frames must be a multiple of 2 (for 3D conv temporal depth)
FPS_MIN_FRAMES = 4       # Minimum frames to sample from any video
FPS_MAX_FRAMES = 768     # Maximum frames to sample
SPATIAL_MERGE_SIZE = 2   # The 2×2 spatial merge factor


# --- Step 1: smart_nframes() — decides HOW MANY frames to sample ---
# This is the core function that computes the target number of frames.

def smart_nframes(ele, total_frames, video_fps):
    """Calculate the number of frames for video used for model inputs.

    Args:
        ele: dict with optional 'fps', 'nframes', 'min_frames', 'max_frames' keys
        total_frames: original total frames in the video
        video_fps: original fps of the video
    Returns:
        int: number of frames to extract (always a multiple of FRAME_FACTOR=2)
    """
    assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"

    if "nframes" in ele:
        # User explicitly set number of frames
        nframes = round_by_factor(ele["nframes"], FRAME_FACTOR)
    else:
        # ⭐ DEFAULT PATH: Sample based on FPS ratio
        fps = ele.get("fps", FPS)                              # default: 2.0 fps
        min_frames = ceil_by_factor(ele.get("min_frames", FPS_MIN_FRAMES), FRAME_FACTOR)   # default: 4
        max_frames = floor_by_factor(ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR)

        #  ⭐ KEY CALCULATION: desired frames = total_frames / original_fps × target_fps
        #  e.g. 60s video at 30fps: 1800 frames / 30 × 2.0 = 120 frames
        nframes = total_frames / video_fps * fps

        nframes = min(min(max(nframes, min_frames), max_frames), total_frames)
        nframes = floor_by_factor(nframes, FRAME_FACTOR)       # round down to multiple of 2

    return int(nframes)


# --- Step 2: The video reader selects WHICH frames via linspace ---
# All three backends (decord, torchvision, torchcodec) use the same pattern:

def _read_video_decord_simplified(ele):
    """Simplified version of _read_video_decord showing the frame selection logic."""
    import decord
    vr = decord.VideoReader(ele["video"])
    total_frames, video_fps = len(vr), vr.get_avg_fps()

    # Get valid frame range (handles video_start/video_end clipping)
    start_frame, end_frame, total_frames = calculate_video_frame_range(ele, total_frames, video_fps)

    # Decide how many frames
    nframes = smart_nframes(ele, total_frames=total_frames, video_fps=video_fps)

    # ⭐ KEY: Uniformly space `nframes` indices across the valid range
    # This is where the actual frame SELECTION happens
    idx = torch.linspace(start_frame, end_frame, nframes).round().long().tolist()

    # Read only the selected frames
    video = vr.get_batch(idx)

    # ⭐ Calculate the effective sample FPS (used later for temporal IDs)
    sample_fps = nframes / max(total_frames, 1e-6) * video_fps

    # Return video tensor + metadata (including which frames were selected)
    video_metadata = dict(
        fps=video_fps,
        frames_indices=idx,              # ← the actual frame indices selected
        total_num_frames=total_frames,
        video_backend="decord",
    )
    return video, video_metadata, sample_fps

print("✅ Source code walkthrough loaded — see comments above for how frame sampling works")
In [ ]:
# ============================================================================
# SOURCE CODE WALKTHROUGH: Token Budget & Resolution in fetch_video()
# From: qwen-vl-utils/src/qwen_vl_utils/vision_process.py
# ============================================================================

# After frames are selected, fetch_video() enforces the token budget by
# adjusting the spatial resolution of each frame.

def fetch_video_simplified(ele, image_patch_size=14):
    """Simplified fetch_video showing the token budget enforcement logic."""

    image_factor = image_patch_size * SPATIAL_MERGE_SIZE      # 14 * 2 = 28 (or 16 * 2 = 32 for Qwen3-VL)
    VIDEO_FRAME_MIN_PIXELS = 128 * image_factor * image_factor  # Min tokens × area per token
    VIDEO_FRAME_MAX_PIXELS = 768 * image_factor * image_factor  # Max tokens × area per token

    # 1. Read video frames (via decord/torchvision/torchcodec)
    video, video_metadata, sample_fps = "_read_video_backend(ele)"

    nframes = video.shape[0]  # number of sampled frames
    height, width = video.shape[2], video.shape[3]

    # 2. Calculate resolution constraints based on token budget
    min_pixels = ele.get("min_pixels", VIDEO_FRAME_MIN_PIXELS)

    # ⭐ MODEL_SEQ_LEN controls the overall context budget (default 128K)
    MODEL_SEQ_LEN = 128000
    total_pixels = ele.get("total_pixels", MODEL_SEQ_LEN * image_factor * image_factor * 0.9)

    # ⭐ KEY: max_pixels per frame shrinks as nframes increases
    # This is the FPS vs resolution trade-off in action!
    FRAME_FACTOR = 2
    max_pixels = max(
        min(VIDEO_FRAME_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR),
        int(min_pixels * 1.05)
    )

    # 3. Resize frames to fit within the computed pixel budget
    # resized_height, resized_width = smart_resize(height, width, min_pixels=min_pixels, max_pixels=max_pixels)
    # video = resize(video, [resized_height, resized_width])

    return "video tensor + metadata"


# Let's see the actual numbers for different scenarios:
print("Token budget examples (image_patch_size=14, merge=2, factor=28):")
print("=" * 70)

image_factor = 14 * 2  # Qwen2.5-VL
for nframes in [10, 30, 60, 120, 240]:
    MODEL_SEQ_LEN = 128000
    total_pixels = MODEL_SEQ_LEN * image_factor * image_factor * 0.9
    VIDEO_FRAME_MAX_PIXELS = 768 * image_factor * image_factor
    VIDEO_FRAME_MIN_PIXELS = 128 * image_factor * image_factor
    max_pixels_per_frame = max(
        min(VIDEO_FRAME_MAX_PIXELS, total_pixels / nframes * 2),
        int(VIDEO_FRAME_MIN_PIXELS * 1.05)
    )
    # Approximate resolution (square)
    approx_side = int(max_pixels_per_frame ** 0.5)
    # Tokens per frame = pixels / (patch_size * merge)^2
    tokens_per_frame = max_pixels_per_frame // (image_factor * image_factor)

    print(f"  {nframes:>3} frames → max {max_pixels_per_frame:>10,} px/frame "
          f"(~{approx_side}×{approx_side}) → ~{tokens_per_frame} tokens/frame "
          f"→ {nframes * tokens_per_frame // 2:,} total tokens (after 3D conv)")

print()
print("Token budget examples (image_patch_size=16, merge=2, factor=32) — Qwen3-VL:")
print("=" * 70)

image_factor = 16 * 2  # Qwen3-VL
for nframes in [10, 30, 60, 120, 240]:
    total_pixels = MODEL_SEQ_LEN * image_factor * image_factor * 0.9
    VIDEO_FRAME_MAX_PIXELS = 768 * image_factor * image_factor
    VIDEO_FRAME_MIN_PIXELS = 128 * image_factor * image_factor
    max_pixels_per_frame = max(
        min(VIDEO_FRAME_MAX_PIXELS, total_pixels / nframes * 2),
        int(VIDEO_FRAME_MIN_PIXELS * 1.05)
    )
    approx_side = int(max_pixels_per_frame ** 0.5)
    tokens_per_frame = max_pixels_per_frame // (image_factor * image_factor)

    print(f"  {nframes:>3} frames → max {max_pixels_per_frame:>10,} px/frame "
          f"(~{approx_side}×{approx_side}) → ~{tokens_per_frame} tokens/frame "
          f"→ {nframes * tokens_per_frame // 2:,} total tokens (after 3D conv)")

4. From Frames to Patches: The Vision Encoder Pipeline ¶

Once frames are sampled, they pass through the Vision Transformer (ViT) encoder. Here's the complete pipeline:

4.1 Patch Extraction¶

Each frame is divided into non-overlapping patches:

  • Qwen2-VL / Qwen2.5-VL: patch_size = 14 → a 224×224 image yields 16×16 = 256 patches
  • Qwen3-VL: patch_size = 16 → a 224×224 image yields 14×14 = 196 patches

4.2 3D Convolution for Video (Temporal Merging)¶

For video inputs, Qwen2-VL introduced 3D convolutions with temporal depth 2. Instead of processing each frame independently:

Frame t   : [patch_1, patch_2, ..., patch_N]    ─┐
                                                   ├──→ 3D Conv ──→ [tube_1, tube_2, ..., tube_N]
Frame t+1 : [patch_1, patch_2, ..., patch_N]    ─┘

This creates 3D tubes that span 2 frames temporally, allowing the model to capture local motion between consecutive frames while halving the number of temporal positions. For consistency, a single image is treated as two identical frames.

4.3 Spatial Token Merging (2×2 Compression)¶

After the ViT processes all patches, an MLP layer compresses every 2×2 spatial block of tokens into a single token:

Before merging: 16 × 16 = 256 tokens per frame
After merging:   8 ×  8 =  64 tokens per frame

Combined with the 3D temporal merging, the total compression from raw frames to visual tokens is substantial:

$$\text{Compression} = \underbrace{2\times}_\text{3D temporal} \times \underbrace{4\times}_\text{2×2 spatial} = 8\times$$

4.4 Dynamic Resolution¶

Unlike fixed-resolution models, Qwen-VL processes images at their native resolution (within configurable min/max bounds). The number of visual tokens scales with image size:

Image Resolution Patches (14×14) After 2×2 Merge + Delimiters
224 × 224 16 × 16 = 256 64 66
448 × 448 32 × 32 = 1,024 256 258
896 × 896 64 × 64 = 4,096 1,024 1,026
1344 × 896 96 × 64 = 6,144 1,536 1,538
In [ ]:
# Visualization: Frame → Patches → 3D Tubes → Merged Tokens

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

# Step 1: Original frames (pair)
ax = axes[0]
ax.set_title('Step 1: Frame Pair\n(sampled from video)', fontsize=10, fontweight='bold')
# Draw two frames
for offset, label, color in [(0, 'Frame t', '#3498db'), (0.55, 'Frame t+1', '#2ecc71')]:
    rect = mpatches.FancyBboxPatch((0.05, offset), 0.9, 0.4,
                                     boxstyle='round,pad=0.02',
                                     facecolor=color, alpha=0.3, edgecolor=color, linewidth=2)
    ax.add_patch(rect)
    ax.text(0.5, offset + 0.2, label, ha='center', va='center', fontsize=10, fontweight='bold')
ax.set_xlim(0, 1)
ax.set_ylim(-0.05, 1.05)
ax.axis('off')

# Step 2: Patch extraction
ax = axes[1]
ax.set_title('Step 2: Patch Extraction\n(patch_size=14)', fontsize=10, fontweight='bold')
grid_size = 8  # simplified
for i in range(grid_size):
    for j in range(grid_size):
        color = plt.cm.Blues(0.3 + 0.5 * ((i + j) % 2))
        rect = mpatches.Rectangle((j/grid_size, i/grid_size), 1/grid_size, 1/grid_size,
                                   facecolor=color, edgecolor='white', linewidth=0.5)
        ax.add_patch(rect)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_xlabel(f'{grid_size}×{grid_size} = {grid_size**2} patches', fontsize=9)
ax.set_xticks([])
ax.set_yticks([])

# Step 3: 3D Conv (temporal merge)
ax = axes[2]
ax.set_title('Step 3: 3D Conv\n(temporal depth=2)', fontsize=10, fontweight='bold')
grid_size = 8
for i in range(grid_size):
    for j in range(grid_size):
        color = plt.cm.Purples(0.3 + 0.5 * ((i + j) % 2))
        rect = mpatches.Rectangle((j/grid_size, i/grid_size), 1/grid_size, 1/grid_size,
                                   facecolor=color, edgecolor='white', linewidth=0.5)
        ax.add_patch(rect)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_xlabel(f'{grid_size}×{grid_size} = {grid_size**2} tubes\n(2 frames → 1 temporal slot)', fontsize=9)
ax.set_xticks([])
ax.set_yticks([])

# Step 4: 2×2 spatial merge
ax = axes[3]
ax.set_title('Step 4: 2×2 Merge\n(spatial compression)', fontsize=10, fontweight='bold')
merged_size = grid_size // 2
for i in range(merged_size):
    for j in range(merged_size):
        color = plt.cm.Oranges(0.3 + 0.5 * ((i + j) % 2))
        rect = mpatches.Rectangle((j/merged_size, i/merged_size), 1/merged_size, 1/merged_size,
                                   facecolor=color, edgecolor='white', linewidth=1)
        ax.add_patch(rect)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_xlabel(f'{merged_size}×{merged_size} = {merged_size**2} tokens\n(+ 2 delimiter tokens = {merged_size**2 + 2})', fontsize=9)
ax.set_xticks([])
ax.set_yticks([])

plt.suptitle('Vision Encoder Pipeline: Frames → Visual Tokens', fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

5. Multimodal Rotary Position Embedding (M-RoPE) ¶

5.1 The Problem with 1D Positional Encodings¶

Standard LLMs use 1D Rotary Position Embedding (RoPE) — each token gets a single position ID that encodes its place in the sequence. This works for text, but for multimodal inputs:

  • Images have 2D spatial structure (height × width)
  • Videos have 3D structure (time × height × width)
  • Flattening these into 1D destroys spatial/temporal relationships

5.2 M-RoPE: Decomposing Position into Three Components¶

Qwen2-VL introduces Multimodal Rotary Position Embedding (M-RoPE), which decomposes the rotary embedding into three independent components:

$$\text{M-RoPE}(\mathbf{x}, \text{pos}) = \text{RoPE}_t(\mathbf{x}_{[0:d/3]}, t) \circ \text{RoPE}_h(\mathbf{x}_{[d/3:2d/3]}, h) \circ \text{RoPE}_w(\mathbf{x}_{[2d/3:d]}, w)$$

Where:

  • $t$ = temporal position ID
  • $h$ = height position ID
  • $w$ = width position ID
  • $d$ = embedding dimension

The embedding dimension is split into three sections (not necessarily equal — in Qwen2.5-VL the split is [16, 24, 24], in Qwen3-VL it's [24, 20, 20]), and each section gets its own rotary encoding.

5.3 How Position IDs Are Assigned¶

The assignment of $(t, h, w)$ position IDs depends on the modality:

For Text tokens:¶

All three components use the same ID, making M-RoPE equivalent to standard 1D-RoPE:

$$\text{Text token } i: \quad (t_i, h_i, w_i) = (i, i, i)$$

For Image tokens:¶

The temporal ID stays constant (the image exists at a single point in time), while height and width IDs reflect spatial position:

$$\text{Image token at row } r, \text{col } c: \quad (t, h, w) = (t_0, r, c)$$

For Video tokens:¶

The temporal ID increments with each frame, while spatial IDs follow the image pattern:

$$\text{Video token at frame } f, \text{row } r, \text{col } c: \quad (t, h, w) = (f, r, c)$$

Cross-modality transitions:¶

When switching between modalities, the position numbering for the new modality starts at max_previous_ID + 1, ensuring monotonic progression.

In [ ]:
# Visualization: M-RoPE position ID assignment

fig, axes = plt.subplots(3, 1, figsize=(16, 8), sharex=True)
component_names = ['Temporal (t)', 'Height (h)', 'Width (w)']
component_colors = ['#e74c3c', '#2ecc71', '#3498db']

# Build a sequence: [text_tokens] [image_tokens] [text_tokens] [video_tokens] [text_tokens]
# Text: "Describe this image" (4 tokens)
# Image: 3×3 = 9 tokens (simplified)
# Text: "Now watch this video" (5 tokens)
# Video: 3 frames × 2×2 = 12 tokens (simplified)
# Text: "What happened?" (3 tokens)

segments = []
labels = []
bg_colors = []

# === TEXT 1 ===
n_text1 = 4
t_ids = list(range(0, n_text1))
h_ids = list(range(0, n_text1))
w_ids = list(range(0, n_text1))
segments.append(('Text', t_ids, h_ids, w_ids))
pos_offset = n_text1

# === IMAGE (3×3 grid, constant temporal) ===
img_h, img_w = 3, 3
t_ids_img = [pos_offset] * (img_h * img_w)  # constant temporal
h_ids_img = []
w_ids_img = []
for r in range(img_h):
    for c in range(img_w):
        h_ids_img.append(pos_offset + r)
        w_ids_img.append(pos_offset + c)
segments.append(('Image', t_ids_img, h_ids_img, w_ids_img))
pos_offset = max(max(t_ids_img), max(h_ids_img), max(w_ids_img)) + 1

# === TEXT 2 ===
n_text2 = 5
t_ids = list(range(pos_offset, pos_offset + n_text2))
h_ids = list(range(pos_offset, pos_offset + n_text2))
w_ids = list(range(pos_offset, pos_offset + n_text2))
segments.append(('Text', t_ids, h_ids, w_ids))
pos_offset += n_text2

# === VIDEO (3 frames, each 2×2) ===
vid_frames, vid_h, vid_w = 3, 2, 2
t_ids_vid = []
h_ids_vid = []
w_ids_vid = []
for f in range(vid_frames):
    for r in range(vid_h):
        for c in range(vid_w):
            t_ids_vid.append(pos_offset + f)
            h_ids_vid.append(pos_offset + r)
            w_ids_vid.append(pos_offset + c)
segments.append(('Video', t_ids_vid, h_ids_vid, w_ids_vid))
pos_offset = max(max(t_ids_vid), max(h_ids_vid), max(w_ids_vid)) + 1

# === TEXT 3 ===
n_text3 = 3
t_ids = list(range(pos_offset, pos_offset + n_text3))
h_ids = list(range(pos_offset, pos_offset + n_text3))
w_ids = list(range(pos_offset, pos_offset + n_text3))
segments.append(('Text', t_ids, h_ids, w_ids))

# Plot each component
seg_colors_map = {'Text': '#f0f0f0', 'Image': '#d5e8d4', 'Video': '#dae8fc'}
component_idx = {
    'Temporal (t)': lambda s: s[1],
    'Height (h)': lambda s: s[2],
    'Width (w)': lambda s: s[3],
}

for ax, comp_name, comp_color in zip(axes, component_names, component_colors):
    pos = 0
    for seg in segments:
        seg_type = seg[0]
        ids = component_idx[comp_name](seg)
        n = len(ids)

        # Background for segment
        ax.axvspan(pos - 0.4, pos + n - 0.6, color=seg_colors_map[seg_type], alpha=0.5)

        # Plot IDs
        ax.bar(range(pos, pos + n), ids, color=comp_color, alpha=0.8, edgecolor='white', width=0.8)

        # Label segment
        ax.text(pos + n/2 - 0.5, max(ids) + 1.5, seg_type, ha='center', fontsize=8,
                fontweight='bold', fontstyle='italic', alpha=0.6)

        pos += n

    ax.set_ylabel(comp_name, fontsize=11, fontweight='bold', color=comp_color)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

axes[-1].set_xlabel('Token Position in Sequence', fontsize=11)
axes[0].set_title('M-RoPE Position ID Assignment Across Modalities', fontsize=13, fontweight='bold')

# Add legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='#f0f0f0', edgecolor='gray', label='Text'),
    Patch(facecolor='#d5e8d4', edgecolor='gray', label='Image'),
    Patch(facecolor='#dae8fc', edgecolor='gray', label='Video'),
]
axes[0].legend(handles=legend_elements, loc='upper left', fontsize=9)

plt.tight_layout()
plt.show()

5.4 Key Insight: Position ID Compression¶

M-RoPE provides a crucial benefit for length extrapolation. Consider a video with 100 frames, each producing an 8×8 grid of tokens:

Encoding Max Position ID Tokens
1D-RoPE 6,400 100 × 64 = 6,400
M-RoPE max(100, 8, 8) = 100 100 × 64 = 6,400

With M-RoPE, the maximum position ID is dramatically smaller because spatial positions are reused across frames and temporal positions are reused across spatial locations. This means:

  • The model can handle much longer videos at inference than it saw during training
  • Despite training with max 16K tokens per video, Qwen2-VL-72B maintains strong performance at 80K inference tokens (see ablation in the Qwen2-VL paper, Figure 5)
In [ ]:
# Comparison: 1D-RoPE vs M-RoPE max position IDs

num_frames_range = np.arange(10, 510, 10)
spatial_grid = 8  # 8x8 merged tokens per frame

# 1D-RoPE: position IDs are sequential
max_1d_ids = num_frames_range * spatial_grid * spatial_grid

# M-RoPE: max ID is max(temporal, height, width)
max_mrope_ids = np.maximum(num_frames_range, spatial_grid)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Left: absolute comparison
ax1.plot(num_frames_range, max_1d_ids, 'r-', linewidth=2, label='1D-RoPE (max pos ID)')
ax1.plot(num_frames_range, max_mrope_ids, 'b-', linewidth=2, label='M-RoPE (max pos ID)')
ax1.fill_between(num_frames_range, max_mrope_ids, max_1d_ids, alpha=0.1, color='red')
ax1.set_xlabel('Number of Video Frames', fontsize=11)
ax1.set_ylabel('Maximum Position ID', fontsize=11)
ax1.set_title('Position ID Growth: 1D-RoPE vs M-RoPE', fontsize=12, fontweight='bold')
ax1.legend(fontsize=10)
ax1.set_yscale('log')
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
ax1.grid(True, alpha=0.3)

# Right: compression ratio
ratio = max_1d_ids / max_mrope_ids
ax2.plot(num_frames_range, ratio, 'g-', linewidth=2)
ax2.fill_between(num_frames_range, 1, ratio, alpha=0.2, color='green')
ax2.set_xlabel('Number of Video Frames', fontsize=11)
ax2.set_ylabel('Compression Ratio (1D / M-RoPE)', fontsize=11)
ax2.set_title('M-RoPE Position ID Compression Ratio', fontsize=12, fontweight='bold')
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.grid(True, alpha=0.3)
ax2.annotate(f'At 500 frames: {ratio[-1]:.0f}× compression',
             xy=(500, ratio[-1]), xytext=(350, ratio[-1]*0.7),
             arrowprops=dict(arrowstyle='->', color='green'),
             fontsize=10, fontweight='bold', color='green')

plt.tight_layout()
plt.show()

6. Temporal Embeddings: How Time Patches Are Created ¶

6.1 From Absolute Time to Temporal Position IDs¶

A key innovation in Qwen2.5-VL is absolute time encoding — the temporal component of M-RoPE is aligned with real wall-clock time rather than just frame indices.

How it works:¶

  1. Dynamic FPS Sampling: The video is sampled at a variable frame rate
  2. Time Mapping: Each sampled frame's real timestamp is converted to a temporal position ID
  3. ID Spacing: The gap between consecutive temporal IDs reflects the actual time gap between frames
Video sampled at 2 fps:
  Frame 0 at t=0.0s  → temporal_id = 0
  Frame 1 at t=0.5s  → temporal_id = 1
  Frame 2 at t=1.0s  → temporal_id = 2
  Frame 3 at t=1.5s  → temporal_id = 3
  ...

Same video sampled at 1 fps:
  Frame 0 at t=0.0s  → temporal_id = 0
  Frame 1 at t=1.0s  → temporal_id = 2  ← gap of 2 reflects 1s interval
  Frame 2 at t=2.0s  → temporal_id = 4
  Frame 3 at t=3.0s  → temporal_id = 6
  ...

This means the model can learn the pace of time from the intervals between temporal IDs, enabling:

  • Understanding that events happen "fast" or "slow"
  • Accurate temporal grounding ("at the 45 second mark...")
  • Consistent behavior across different sampling rates

6.2 The Complete Temporal Patch Pipeline¶

Here's how a video becomes temporally-aware tokens:

Raw Video (variable FPS)
     │
     ▼
Dynamic FPS Sampling → N frames with timestamps [t₁, t₂, ..., tₙ]
     │
     ▼
Patch Extraction (14×14 or 16×16 per frame)
     │
     ▼
3D Convolution (temporal depth=2): pairs of frames → 3D tubes
     │                                  N/2 temporal positions remain
     ▼
ViT Processing with 2D-RoPE (spatial encoding within each frame)
     │
     ▼
2×2 Spatial Merging (MLP) → compressed visual tokens
     │
     ▼
M-RoPE Assignment:
  • temporal_id[frame_f] = f(timestamp_f)  ← aligned with real time
  • height_id[row_r]     = base + r
  • width_id[col_c]      = base + c
     │
     ▼
Interleaved with text tokens → LLM backbone

6.3 Qwen3-VL: Interleaved M-RoPE and Text-Timestamp Alignment¶

Qwen3-VL introduces two further refinements:

  1. Interleaved-MRoPE: Instead of partitioning the embedding dimensions into contiguous blocks for (t, h, w), the three components are interleaved across the full frequency spectrum. This gives all three dimensions access to both high and low frequencies, improving long-horizon video reasoning.

  2. Text-Timestamp Alignment: Moves beyond T-RoPE (temporal RoPE) to precise timestamp-grounded event localization. The model can directly associate generated text tokens with specific video timestamps, enabling more accurate temporal grounding in responses.

In [ ]:
# Visualization: Absolute time encoding in M-RoPE temporal IDs

fig, axes = plt.subplots(2, 1, figsize=(14, 7))

# Scenario: 10-second video with different sampling rates
video_duration = 10  # seconds

# Top plot: Fixed 2fps sampling
ax = axes[0]
ax.set_title('Fixed 2 FPS Sampling — Uniform Temporal ID Spacing', fontsize=11, fontweight='bold')

fps_fixed = 2
n_frames = int(video_duration * fps_fixed)
timestamps = np.arange(n_frames) / fps_fixed
# After 3D conv (temporal depth 2), pairs merge:
temporal_ids = np.arange(n_frames)  # Each frame gets sequential ID

ax.stem(timestamps, temporal_ids, linefmt='b-', markerfmt='bo', basefmt='gray', label='Temporal IDs')
for i, (t, tid) in enumerate(zip(timestamps, temporal_ids)):
    ax.annotate(f't_id={tid}', (t, tid), textcoords="offset points", xytext=(5, 5), fontsize=7)
ax.set_ylabel('Temporal Position ID', fontsize=10)
ax.set_xlabel('Wall-Clock Time (seconds)', fontsize=10)
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Bottom plot: Dynamic FPS with absolute time alignment
ax = axes[1]
ax.set_title('Dynamic FPS with Absolute Time Alignment — Variable ID Spacing', fontsize=11, fontweight='bold')

# Simulate dynamic sampling: higher FPS during action (2-5s), lower otherwise
timestamps_dynamic = np.array([0.0, 1.0, 2.0, 2.25, 2.5, 2.75, 3.0, 3.25, 3.5, 3.75, 4.0, 4.25, 4.5, 4.75, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0])
# Temporal IDs proportional to time (at some base rate)
base_rate = 2  # IDs per second
temporal_ids_dynamic = (timestamps_dynamic * base_rate).astype(int)

# Color code by sampling density
colors_dynamic = ['#3498db' if t < 2 or t > 5 else '#e74c3c' for t in timestamps_dynamic]

markerline, stemlines, baseline = ax.stem(timestamps_dynamic, temporal_ids_dynamic,
                                           linefmt='gray', markerfmt='o', basefmt='gray')
markerline.set_color('#333')
# Color individual stem lines by sampling density
stemlines.set_color(colors_dynamic)

# Highlight regions
ax.axvspan(2, 5, alpha=0.1, color='red', label='High-action region (4 fps)')
ax.axvspan(0, 2, alpha=0.1, color='blue', label='Static region (1 fps)')
ax.axvspan(5, 10, alpha=0.1, color='blue')

ax.set_ylabel('Temporal Position ID', fontsize=10)
ax.set_xlabel('Wall-Clock Time (seconds)', fontsize=10)
ax.legend(fontsize=9, loc='upper left')
ax.grid(True, alpha=0.3)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Annotate the gap difference
ax.annotate('ID gap = 2\n(1 fps)', xy=(1.5, 3), fontsize=9, color='blue',
            ha='center', fontweight='bold')
ax.annotate('ID gap = 0-1\n(4 fps)', xy=(3.5, 7), fontsize=9, color='red',
            ha='center', fontweight='bold')

plt.tight_layout()
plt.show()
In [ ]:
# Visualization: M-RoPE dimension allocation

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

models = [
    ('Qwen2-VL\nmrope_section', [16, 24, 24], 'Standard M-RoPE'),
    ('Qwen2.5-VL\nmrope_section', [16, 24, 24], 'Standard M-RoPE'),
    ('Qwen3-VL\nmrope_section', [24, 20, 20], 'Interleaved M-RoPE'),
]

colors_rope = ['#e74c3c', '#2ecc71', '#3498db']  # temporal, height, width
labels_rope = ['Temporal', 'Height', 'Width']

for ax, (model_name, sections, style) in zip(axes, models):
    total = sum(sections)
    wedges, texts, autotexts = ax.pie(
        sections, labels=[f'{l}\n({s}/{total})' for l, s in zip(labels_rope, sections)],
        colors=colors_rope, autopct='%1.0f%%',
        startangle=90, pctdistance=0.65,
        textprops={'fontsize': 9}
    )
    for autotext in autotexts:
        autotext.set_fontweight('bold')
        autotext.set_fontsize(10)
    ax.set_title(f'{model_name}\n({style})', fontsize=10, fontweight='bold')

plt.suptitle('M-RoPE Dimension Allocation Across Qwen-VL Generations',
             fontsize=13, fontweight='bold', y=1.05)
plt.tight_layout()
plt.show()

6.4 Where Temporal Embeddings Are Created in Code¶

The M-RoPE position IDs are computed in rope2d.py. There are three versions of the function, one per model generation:

Function Model Key Difference
get_rope_index_2() Qwen2-VL Frame-index-based temporal IDs
get_rope_index_25() Qwen2.5-VL Absolute time–based temporal IDs (second_per_grid_ts)
get_rope_index_3() Qwen3-VL Timestamps rather than absolute time position IDs; each frame split to t=1

All three functions share the same signature and return shape (3, batch_size, seq_len) — one row for temporal, height, and width position IDs.

Below is an annotated walkthrough of the actual code.

In [ ]:
# ============================================================================
# SOURCE CODE WALKTHROUGH: M-RoPE Position ID Generation
# From: qwen-vl-finetune/qwenvl/data/rope2d.py
# https://github.com/QwenLM/Qwen3-VL/tree/main/qwen-vl-finetune/qwenvl/data/rope2d.py
# ============================================================================

import numpy as np

# --- Qwen2.5-VL: get_rope_index_25() —— Absolute Time Encoding ---
#
# The docstring from the actual source explains the scheme clearly:
#
#   For pure text: temporal, height, width position IDs are all identical (= 1D RoPE).
#     input_ids:  [T  T  T  T  T]
#     temporal:   [0, 1, 2, 3, 4]
#     height:     [0, 1, 2, 3, 4]
#     width:      [0, 1, 2, 3, 4]
#
#   For video (3 temporal patches, 2 height, 2 width) followed by text:
#     input_ids:  [V  V  V  V  V  V  V  V  V  V  V  V  T  T  T  T  T]
#
#     ⭐ The temporal position IDs use ABSOLUTE TIME via `second_per_grid_ts`:
#     temporal:   [0, 0, 0, 0, 50, 50, 50, 50, 100, 100, 100, 100]
#                  ^frame 0       ^frame 1          ^frame 2
#     height:     [0, 0, 1, 1,  0,  0,  1,  1,   0,   0,   1,   1]
#     width:      [0, 1, 0, 1,  0,  1,  0,  1,   0,   1,   0,   1]
#
#   The interval=50 comes from: tokens_per_second × temporal_patch_size / fps
#   e.g., 25 tokens/s × 2 frames/patch / 1 fps = 50
#
#   Text tokens resume from the max visual position ID + 1:
#     temporal:   [..., 100, 100, 100, 100,  101, 102, 103, 104, 105]
#     height:     [...,   0,   0,   1,   1,  101, 102, 103, 104, 105]
#     width:      [...,   0,   1,   0,   1,  101, 102, 103, 104, 105]

# Here is the KEY inner loop from get_rope_index_25 (simplified, using numpy):
# Original source uses torch — we replicate the logic with numpy for illustration.

def compute_video_position_ids_25(t, h, w, second_per_grid_ts, spatial_merge_size=2):
    """
    Compute M-RoPE position IDs for a single video in Qwen2.5-VL style.

    Args:
        t, h, w: temporal, height, width grid dimensions (AFTER 3D conv + spatial merge)
        second_per_grid_ts: seconds per temporal grid step (from sample_fps)
        spatial_merge_size: spatial merge factor (default=2)

    Returns:
        Array of shape (3, t*h*w) — the (temporal, height, width) position IDs
    """
    llm_grid_t = t
    llm_grid_h = h // spatial_merge_size
    llm_grid_w = w // spatial_merge_size

    # ⭐ KEY: temporal IDs are spaced by real-time intervals
    # In the source:  t_index = torch.arange(llm_grid_t).view(-1,1).expand(-1, llm_grid_h*llm_grid_w).flatten()
    t_index = np.arange(llm_grid_t).reshape(-1, 1) * np.ones((1, llm_grid_h * llm_grid_w))
    t_index = (t_index.flatten() * second_per_grid_ts).astype(int)  # Scale by actual time interval!

    # Spatial IDs just count rows and columns (reset for each frame)
    # In the source:  h_index = torch.arange(llm_grid_h).view(1,-1,1).expand(llm_grid_t,-1,llm_grid_w).flatten()
    h_index = np.tile(np.arange(llm_grid_h).reshape(1, -1, 1) * np.ones((1, 1, llm_grid_w)),
                      (llm_grid_t, 1, 1)).flatten().astype(int)
    w_index = np.tile(np.ones((1, llm_grid_h, 1)) * np.arange(llm_grid_w).reshape(1, 1, -1),
                      (llm_grid_t, 1, 1)).flatten().astype(int)

    return np.stack([t_index, h_index, w_index])  # shape: (3, t*h*w)


# Demo: Qwen2.5-VL with a 10-second video at 2fps
fps = 2.0
temporal_patch_size = 2  # 3D conv merges pairs
video_duration = 10      # seconds
total_temporal_patches = int(video_duration * fps / temporal_patch_size)  # 10 * 2 / 2 = 10

# second_per_grid_ts = tokens_per_second * temporal_patch_size / sample_fps
# In Qwen2.5-VL, this defaults to around 1/fps in normalized form
# But the actual code computes it as a scaling factor
second_per_grid_ts = 1.0 / fps * temporal_patch_size  # = 1.0 second per grid step

h_grid, w_grid = 4, 4  # Small example: 4×4 spatial grid after merge
pos_ids = compute_video_position_ids_25(
    t=total_temporal_patches, h=h_grid * 2, w=w_grid * 2,  # pre-merge dimensions
    second_per_grid_ts=second_per_grid_ts * 25  # scaled by tokens_per_second=25
)

print("Qwen2.5-VL M-RoPE Position IDs (10s video, 2fps, 4×4 spatial grid)")
print("=" * 65)
print(f"  Video: {video_duration}s at {fps}fps → {total_temporal_patches} temporal patches")
print(f"  Spatial: {h_grid}×{w_grid} = {h_grid*w_grid} tokens per temporal patch")
print(f"  Total visual tokens: {total_temporal_patches * h_grid * w_grid}")
print()
print(f"  Temporal IDs (unique): {sorted(set(pos_ids[0].tolist()))}")
print(f"  Height IDs  (unique):  {sorted(set(pos_ids[1].tolist()))}")
print(f"  Width IDs   (unique):  {sorted(set(pos_ids[2].tolist()))}")
print(f"  Max position ID:       {pos_ids.max()}")
print(f"  If using 1D-RoPE:      max ID would be {total_temporal_patches * h_grid * w_grid - 1}")
In [ ]:
# ============================================================================
# Qwen3-VL: get_rope_index_3() — Timestamp-Based Temporal Encoding
# ============================================================================
#
# KEY DIFFERENCE from Qwen2.5-VL (from the docstring):
#   "Different from the original implementation, Qwen3VL uses timestamps
#    rather than absolute time position ids."
#
# The big change: instead of assigning a SINGLE temporal ID block per frame,
# Qwen3-VL treats EACH FRAME as an independent image (t=1) and inserts
# timestamp tokens BETWEEN frames.
#
# Here is how the actual source code transforms the video data:
#
#   ⭐ STEP 1: Explode video_grid_thw so each frame stands alone
#   video_grid_thw = torch.repeat_interleave(video_grid_thw, video_grid_thw[:, 0], dim=0)
#   video_grid_thw[:, 0] = 1   # Every frame is now a "single-frame video"
#
#   ⭐ STEP 2: For each frame, spatial IDs reset (just like an image):
#   t_index = [0, 0, 0, 0]   (all same — single frame)
#   h_index = [0, 0, 1, 1]   (row positions)
#   w_index = [0, 1, 0, 1]   (column positions)
#
#   ⭐ STEP 3: Timestamp tokens between frames get st_idx incremented
#   Each <|im_start|>, timestamp text, <|im_end|> between frames acts as
#   a 1D-RoPE segment (all 3 dims equal), naturally separating frames
#   in the temporal dimension.
#
# This means:
# - Frames don't need explicit temporal position IDs (spatial-only like images)
# - Temporal ordering comes from timestamp TEXT tokens between frames
# - The model learns temporal relationships from the timestamp values
# - It's more flexible: frame spacing can be irregular

def visualize_qwen3_position_ids(n_frames=3, h=2, w=2, timestamp_tokens=5):
    """
    Simulate the M-RoPE position IDs that get_rope_index_3() would produce
    for a short video with n_frames, each having h×w spatial tokens, and
    timestamp_tokens between each pair of frames.
    """
    all_temporal = []
    all_height = []
    all_width = []
    labels = []

    st_idx = 0  # Running position counter (from source: starts at 0 or continues from previous)

    for i in range(n_frames):
        # --- Frame i: treated as an independent image ---
        llm_grid_h = h
        llm_grid_w = w
        n_tokens = llm_grid_h * llm_grid_w

        # All temporal IDs are st_idx (single frame, t=1)
        t_ids = [st_idx] * n_tokens
        # Spatial IDs: standard row/col pattern
        h_ids = []
        w_ids = []
        for r in range(llm_grid_h):
            for c in range(llm_grid_w):
                h_ids.append(r)
                w_ids.append(c)

        all_temporal.extend(t_ids)
        all_height.extend(h_ids)
        all_width.extend(w_ids)
        labels.extend([f"F{i}"] * n_tokens)

        # ⭐ KEY: st_idx jumps to max+1 across all 3 dims
        max_id = max(max(t_ids), max(h_ids), max(w_ids))
        st_idx = max_id + 1

        # --- Timestamp tokens between frames (1D-RoPE: all dims equal) ---
        if i < n_frames - 1:
            for t in range(timestamp_tokens):
                all_temporal.append(st_idx)
                all_height.append(st_idx)
                all_width.append(st_idx)
                labels.append("TS")
                st_idx += 1

    return all_temporal, all_height, all_width, labels


temporal, height, width, labels = visualize_qwen3_position_ids(
    n_frames=3, h=2, w=2, timestamp_tokens=4
)

print("Qwen3-VL M-RoPE Position IDs (3 frames, 2×2 spatial, 4 timestamp tokens)")
print("=" * 72)
print()

# Print a nice table
header = f"{'Token':>6}  {'temporal':>8}  {'height':>8}  {'width':>8}"
print(header)
print("-" * len(header))

for i, (t, h, w, lab) in enumerate(zip(temporal, height, width, labels)):
    marker = "  ← timestamp" if lab == "TS" else f"  ← frame {lab}"
    print(f"{lab:>6}  {t:>8}  {h:>8}  {w:>8}{marker}")

print()
print("KEY OBSERVATIONS:")
print("  • Each frame's spatial IDs RESET (h: 0→1, w: 0→1) — just like an image")
print("  • Temporal IDs are FLAT within each frame (all same value)")
print("  • Timestamp tokens use 1D-RoPE (all 3 dims equal) — separates frames")
print("  • Frame temporal position comes from the TIMESTAMP TEXT, not from an index")
print()
# Compare max position IDs
print(f"  Max position ID used:  {max(max(temporal), max(height), max(width))}")
n_visual = sum(1 for l in labels if l != "TS")
n_ts = sum(1 for l in labels if l == "TS")
print(f"  Visual tokens: {n_visual},  Timestamp tokens: {n_ts},  Total: {len(labels)}")
In [ ]:
# Visual comparison: Qwen2.5-VL vs Qwen3-VL temporal position ID patterns

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# --- Left: Qwen2.5-VL style (block temporal IDs) ---
ax = axes[0]
n_frames_25, h_25, w_25 = 4, 2, 2
tokens_per_frame = h_25 * w_25
interval = 50  # second_per_grid_ts * tokens_per_second

temporal_25 = []
height_25 = []
width_25 = []
for f in range(n_frames_25):
    for r in range(h_25):
        for c in range(w_25):
            temporal_25.append(f * interval)
            height_25.append(r)
            width_25.append(c)

x_pos = np.arange(len(temporal_25))
ax.scatter(x_pos, temporal_25, c='#e74c3c', s=60, label='Temporal', zorder=3)
ax.scatter(x_pos, height_25, c='#2ecc71', s=40, marker='s', label='Height', zorder=3)
ax.scatter(x_pos, width_25, c='#3498db', s=40, marker='^', label='Width', zorder=3)

# Frame separators
for f in range(1, n_frames_25):
    ax.axvline(x=f * tokens_per_frame - 0.5, color='gray', linestyle='--', alpha=0.5)

ax.set_title("Qwen2.5-VL: Absolute Time Encoding", fontweight='bold', fontsize=11)
ax.set_xlabel("Token index")
ax.set_ylabel("Position ID")
ax.legend(fontsize=8, loc='upper left')
ax.set_ylim(-5, max(temporal_25) + 20)
# Label frames
for f in range(n_frames_25):
    ax.text(f * tokens_per_frame + tokens_per_frame/2 - 0.5, max(temporal_25) + 10,
            f"Frame {f}", ha='center', fontsize=8, color='gray')

# --- Right: Qwen3-VL style (per-frame reset + timestamps) ---
ax = axes[1]
temporal_3, height_3, width_3, labels_3 = visualize_qwen3_position_ids(
    n_frames=4, h=2, w=2, timestamp_tokens=3
)

x_pos = np.arange(len(temporal_3))
colors_t = ['#e74c3c' if l != 'TS' else '#e74c3c' for l in labels_3]
colors_h = ['#2ecc71' if l != 'TS' else '#2ecc71' for l in labels_3]
colors_w = ['#3498db' if l != 'TS' else '#3498db' for l in labels_3]
markers_t = ['o' if l != 'TS' else 'x' for l in labels_3]

# Visual tokens
vis_mask = np.array([l != 'TS' for l in labels_3])
ts_mask = ~vis_mask

ax.scatter(x_pos[vis_mask], np.array(temporal_3)[vis_mask], c='#e74c3c', s=60,
           label='Temporal (visual)', zorder=3)
ax.scatter(x_pos[vis_mask], np.array(height_3)[vis_mask], c='#2ecc71', s=40,
           marker='s', label='Height (visual)', zorder=3)
ax.scatter(x_pos[vis_mask], np.array(width_3)[vis_mask], c='#3498db', s=40,
           marker='^', label='Width (visual)', zorder=3)

# Timestamp tokens (all 3 dims same — shown as diamonds)
ax.scatter(x_pos[ts_mask], np.array(temporal_3)[ts_mask], c='#9b59b6', s=50,
           marker='D', label='Timestamp (all dims)', zorder=3)

# Frame/timestamp separators
prev_label = labels_3[0]
for i, l in enumerate(labels_3[1:], 1):
    if (prev_label != 'TS' and l == 'TS') or (prev_label == 'TS' and l != 'TS'):
        ax.axvline(x=i - 0.5, color='gray', linestyle='--', alpha=0.3)
    prev_label = l

ax.set_title("Qwen3-VL: Timestamp-Based Encoding", fontweight='bold', fontsize=11)
ax.set_xlabel("Token index")
ax.set_ylabel("Position ID")
ax.legend(fontsize=7, loc='upper left')

# Label frames
frame_idx = 0
i = 0
while i < len(labels_3):
    if labels_3[i] != 'TS':
        start = i
        while i < len(labels_3) and labels_3[i] != 'TS':
            i += 1
        mid = (start + i - 1) / 2
        ax.text(mid, max(temporal_3) + 1, f"F{frame_idx}", ha='center', fontsize=8, color='gray')
        frame_idx += 1
    else:
        start = i
        while i < len(labels_3) and labels_3[i] == 'TS':
            i += 1
        mid = (start + i - 1) / 2
        ax.text(mid, max(temporal_3) + 1, "TS", ha='center', fontsize=7, color='#9b59b6')

plt.suptitle("M-RoPE Temporal Position ID Patterns: Qwen2.5-VL vs Qwen3-VL",
             fontweight='bold', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

print("\nSummary of Architectural Differences:")
print("  Qwen2.5-VL: Large temporal ID gaps between frames (interval=50)")
print("              Encodes absolute wall-clock time directly in position IDs")
print("              Spatial IDs reset each frame, temporal IDs grow with time")
print()
print("  Qwen3-VL:   Each frame is an independent image (spatial IDs reset)")
print("              Timestamp TEXT tokens between frames carry temporal info")
print("              Position IDs grow smoothly; temporal meaning is in the content")
print("              More flexible — works naturally with variable frame rates")

7. Effective Uses of VLMs for Video Understanding ¶

VLMs like Qwen3-VL have demonstrated strong performance across a variety of video understanding tasks. Here we categorize the tasks where VLMs excel:

7.1 ✅ Video Question Answering (VideoQA)¶

What: Answering natural language questions about video content.
Why VLMs excel: The text generation capability of the LLM backbone combines naturally with visual understanding.

Benchmarks: MVBench, PerceptionTest, EgoSchema, Video-MME

Model MVBench PerceptionTest Video-MME (w/o subs)
GPT-4o — — 71.9
Qwen2-VL-72B 73.6 68.0 71.2
Qwen2.5-VL-7B 69.6 70.5 65.1

Best practices:

  • Provide clear, specific questions
  • For long videos, include timestamp hints when possible
  • Use higher sampling FPS for action-dense segments

7.2 ✅ Temporal Event Localization & Grounding¶

What: Identifying when specific events occur in a video (e.g., "At what timestamp does the speaker mention AI?").
Why VLMs excel: Qwen2.5-VL's absolute time encoding and Qwen3-VL's Text-Timestamp Alignment enable second-level precision.

Benchmarks: CharadesSTA, TempCompass, LVBench

Model CharadesSTA (mIoU) TempCompass
Qwen2.5-VL-7B 43.6 71.7

7.3 ✅ Video Summarization & Content Extraction¶

What: Generating structured summaries, extracting key information, or creating content descriptions.
Why VLMs excel: The LLM backbone provides strong language generation, and the visual encoder captures key visual details.

Example tasks:

  • Meeting summarization from recordings
  • Extracting paper titles from conference video recordings
  • Generating captions for accessibility

7.4 ✅ Video OCR & Text-in-Video Understanding¶

What: Reading and understanding text that appears in videos (slides, signs, documents on screen).
Why VLMs excel: Qwen-VL models have particularly strong OCR capabilities that transfer to video frames.

7.5 ✅ Long Video Comprehension (1+ Hours)¶

What: Understanding content across very long videos.
Why VLMs are improving: Qwen2.5-VL supports hour-long videos, and Qwen3-VL extends to 256K native context (expandable to 1M). The M-RoPE position compression keeps position IDs manageable.

Benchmarks: MLVU (70.2 for Qwen2.5-VL-7B), LongVideoBench (54.7)

7.6 ✅ Visual Agent Tasks on Screen Recordings¶

What: Understanding and reasoning about UI interactions captured in video (screen recordings, phone recordings).
Why VLMs excel: Qwen-VL's grounding capabilities (bounding boxes, point detection) combine with video understanding for autonomous agent applications.

Benchmarks: ScreenSpot, AndroidWorld, MobileMiniWob++

In [ ]:
# Visualization: Task effectiveness spectrum

tasks = [
    ('Video QA\n(short clips)', 0.92, '✅'),
    ('Video OCR &\nText Extraction', 0.90, '✅'),
    ('UI Agent\nScreen Recording', 0.85, '✅'),
    ('Temporal\nGrounding', 0.80, '✅'),
    ('Video\nSummarization', 0.78, '✅'),
    ('Long Video\n(1+ hr) QA', 0.70, '⚠️'),
    ('Real-time\nVideo Chat', 0.65, '⚠️'),
    ('Fine-grained\nAction Recognition', 0.55, '⚠️'),
    ('Precise Counting\n& Tracking', 0.40, '❌'),
    ('3D Spatial\nReasoning', 0.35, '❌'),
    ('Audio-Visual\nReasoning', 0.25, '❌'),
]

task_names = [t[0] for t in tasks]
scores = [t[1] for t in tasks]
status = [t[2] for t in tasks]

color_map = {'✅': '#2ecc71', '⚠️': '#f39c12', '❌': '#e74c3c'}
bar_colors = [color_map[s] for s in status]

fig, ax = plt.subplots(figsize=(12, 6))
bars = ax.barh(range(len(tasks)), scores, color=bar_colors, edgecolor='white', linewidth=1.5, height=0.7)
ax.set_yticks(range(len(tasks)))
ax.set_yticklabels(task_names, fontsize=9)
ax.set_xlabel('Relative VLM Effectiveness', fontsize=11)
ax.set_title('VLM Effectiveness Across Video Understanding Tasks', fontsize=13, fontweight='bold')
ax.set_xlim(0, 1.15)
ax.invert_yaxis()

# Add status labels
for i, (score, s) in enumerate(zip(scores, status)):
    ax.text(score + 0.02, i, f'{s} {score:.0%}', va='center', fontsize=9, fontweight='bold')

# Legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='#2ecc71', label='Strong performance'),
    Patch(facecolor='#f39c12', label='Improving / Mixed'),
    Patch(facecolor='#e74c3c', label='Significant challenges'),
]
ax.legend(handles=legend_elements, loc='lower right', fontsize=9)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.grid(True, axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

8. Challenges & Limitations ¶

Despite impressive progress, VLMs for video understanding face significant challenges:

8.1 🔴 Frame Sampling Creates an Information Bottleneck¶

The fundamental trade-off: Sampling at 2 fps discards 93% of frames from a 30fps video. This means:

  • Brief events can be missed entirely — a hand gesture lasting 0.3 seconds may fall between sampled frames
  • Rapid motion causes aliasing — fast-moving objects appear as discontinuous jumps
  • Subtle temporal patterns are lost — lip movements, fine-grained actions, rhythmic patterns

Even with dynamic FPS, the model cannot recover information from frames that were never sampled.

8.2 🔴 Quadratic Attention Scaling¶

Transformer self-attention scales as $O(n^2)$ with sequence length. For video:

$$\text{Attention cost} \propto (N_\text{text} + N_\text{visual})^2$$

A 1-minute video at 2fps with 448×448 frames generates: $$120 \text{ frames} \times 256 \text{ tokens/frame} / 2 \text{ (3D conv)} = 15,360 \text{ visual tokens}$$

This creates severe compute and memory pressure, especially for longer videos. Window attention (Qwen2.5-VL) and Interleaved-MRoPE (Qwen3-VL) partially address this, but it remains a core constraint.

8.3 🔴 Temporal Reasoning Is Shallow¶

Current VLMs struggle with:

  • Causal reasoning: "What caused the vase to fall?" requires linking an earlier action to a later event
  • Counting temporal events: "How many times did the person wave?" — models often approximate rather than count precisely
  • Ordering: "Did event A happen before or after event B?" — surprisingly difficult even with temporal embeddings
  • Speed estimation: "Is the car accelerating or decelerating?" — requires fine temporal resolution

8.4 🔴 No Native Audio Understanding¶

Most VLMs (including Qwen-VL) process only the visual stream. This means:

  • Spoken dialogue is not directly understood
  • Sound effects that provide crucial context (a door slamming, a phone ringing) are invisible
  • Music and audio cues that drive narrative are missed

Video-MME evaluates both with and without subtitles — performance consistently jumps 5-8% when subtitles are added, highlighting the audio information gap.

8.5 🔴 Hallucination & Confabulation¶

VLMs can:

  • Hallucinate events that didn't occur in the video
  • Confuse temporal order of events
  • Over-rely on language priors — generating plausible but incorrect descriptions based on common sense rather than actual visual evidence
  • Struggle with negation — correctly identifying what is not happening is harder than what is

8.6 🟡 Resolution vs Throughput Trade-off¶

Higher frame resolution improves detail recognition but:

  • Increases tokens per frame quadratically
  • Forces either fewer frames (losing temporal coverage) or exceeding token budgets
  • The optimal balance is task-dependent and hard to determine automatically

8.7 🟡 Evaluation Challenges¶

Video understanding benchmarks have their own limitations:

  • Many can be partially solved from single frames (static bias)
  • Multiple-choice formats allow guessing
  • Open-ended evaluation is subjective and expensive
  • Real-world use cases (surveillance, medical video, industrial inspection) are underrepresented
In [ ]:
# Visualization: The information loss from frame sampling

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Information retention vs sampling rate
ax = axes[0]
original_fps = 30
sample_rates = np.linspace(0.5, 30, 100)
retention = sample_rates / original_fps * 100

ax.fill_between(sample_rates, retention, 100, alpha=0.15, color='red', label='Information lost')
ax.fill_between(sample_rates, 0, retention, alpha=0.15, color='green', label='Information retained')
ax.plot(sample_rates, retention, 'k-', linewidth=2)

# Mark common operating points
for fps, label, color in [(1, '1 fps', '#e74c3c'), (2, '2 fps (default)', '#f39c12'),
                           (4, '4 fps', '#2ecc71'), (8, '8 fps', '#3498db')]:
    ret = fps / original_fps * 100
    ax.plot(fps, ret, 'o', color=color, markersize=10, zorder=5)
    ax.annotate(f'{label}\n({ret:.1f}%)', xy=(fps, ret), xytext=(fps + 2, ret + 5),
                fontsize=8, fontweight='bold', color=color,
                arrowprops=dict(arrowstyle='->', color=color, lw=1.5))

ax.set_xlabel('Sampling Rate (fps)', fontsize=11)
ax.set_ylabel('Frame Retention (%)', fontsize=11)
ax.set_title('Frame Information Retention vs Sampling Rate\n(original: 30 fps)', fontsize=11, fontweight='bold')
ax.legend(fontsize=9, loc='center right')
ax.set_ylim(0, 105)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.grid(True, alpha=0.3)

# Right: Compute cost scaling
ax = axes[1]
n_tokens = np.linspace(100, 100000, 200)
attention_cost = n_tokens ** 2

# Mark video lengths
durations = [10, 60, 300, 3600]  # seconds
duration_labels = ['10s', '1min', '5min', '1hr']
fps = 2
tokens_per_frame = 128  # after merging

for dur, label in zip(durations, duration_labels):
    n_tok = dur * fps * tokens_per_frame / 2  # /2 for 3D conv
    if n_tok <= n_tokens[-1]:
        cost = n_tok ** 2
        ax.axvline(x=n_tok, color='gray', linestyle=':', alpha=0.5)
        ax.text(n_tok, ax.get_ylim()[1] if ax.get_ylim()[1] > 0 else 1e9, f' {label}\n({n_tok:.0f} tok)',
                fontsize=8, rotation=0, va='top')

ax.plot(n_tokens, attention_cost, 'r-', linewidth=2, label='$O(n^2)$ attention')
ax.set_xlabel('Total Visual Tokens', fontsize=11)
ax.set_ylabel('Relative Attention Cost', fontsize=11)
ax.set_title('Attention Cost Scaling with Video Length\n(2 fps, 448×448)', fontsize=11, fontweight='bold')
ax.set_yscale('log')
ax.set_xscale('log')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.grid(True, alpha=0.3)
ax.legend(fontsize=10)

plt.tight_layout()
plt.show()

Summary: VLM Strengths vs Challenges for Video Understanding¶

Dimension Current State Key Bottleneck
Short video QA ✅ Strong Benchmark saturation
Video OCR ✅ Strong Frame resolution
Temporal localization ✅ Good (improving) Absolute time alignment
Long video comprehension ⚠️ Improving Token budget, context length
Fine-grained temporal reasoning ⚠️ Mixed Frame sampling, attention limits
Precise counting & tracking ❌ Weak Not designed for continuous tracking
Audio-visual understanding ❌ Missing No audio modality
Real-time processing ❌ Limited Inference latency
3D spatial reasoning from video ⚠️ Emerging Qwen3-VL introduces 3D grounding

Looking Forward¶

The Qwen team has signaled several directions:

  1. Enhanced reasoning: Qwen3-VL Thinking editions incorporate chain-of-thought reasoning for complex video tasks
  2. Omni-model convergence: Moving toward unified models that handle text, images, video, and audio
  3. MoE architectures: The Qwen3-VL-235B-A22B (Mixture of Experts) demonstrates scaling without proportional compute increase
  4. Extended context: From 32K (Qwen2-VL) to 256K (Qwen3-VL) native context, with YaRN extension to 1M+
  5. Agent capabilities: Combining video understanding with tool use and autonomous operation

9. References ¶

Primary Papers¶

  1. Qwen2-VL: Wang, P., Bai, S., et al. "Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution." arXiv:2409.12191, 2024.

  2. Qwen2.5-VL: Bai, S., Chen, K., et al. "Qwen2.5-VL Technical Report." arXiv:2502.13923, 2025.

  3. Qwen3-VL: Bai, S., Cai, Y., et al. "Qwen3-VL Technical Report." arXiv:2511.21631, 2025.

  4. Qwen-VL: Bai, J., Bai, S., et al. "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond." arXiv:2308.12966, 2023.

Foundational Work¶

  1. RoPE: Su, J., et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." Neurocomputing, 2024.

  2. ViT: Dosovitskiy, A., et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR, 2021.

  3. ViViT: Arnab, A., et al. "ViViT: A Video Vision Transformer." ICCV, 2021.

  4. YaRN: "YaRN: Efficient Context Window Extension of Large Language Models." arXiv:2309.00071, 2023.

  5. NaViT: Dehghani, M., et al. "Patch n'Pack: NaViT, a Vision Transformer for Any Aspect Ratio and Resolution." NeurIPS, 2024.

Benchmarks¶

  1. Video-MME: Fu, C., et al. "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis." arXiv:2405.21075, 2024.

  2. MVBench: Li, K., et al. "MVBench: A Comprehensive Multi-modal Video Understanding Benchmark." CVPR, 2024.

  3. EgoSchema: Mangalam, K., et al. "EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding." NeurIPS, 2023.

Resources¶

  • 🤗 Qwen2.5-VL on Hugging Face: https://huggingface.co/collections/Qwen/qwen25-vl
  • 🐙 Qwen3-VL GitHub: https://github.com/QwenLM/Qwen3-VL
  • 📚 Cookbooks: https://github.com/QwenLM/Qwen3-VL/tree/main/cookbooks
  • 💬 Qwen Blog: https://qwen.ai/blog