CS 486/686 Lecture 23 - Dissecting a Vision-Language Model

CS 486/686
Dissecting a Vision-Language Model

Yuntian Deng

Lecture 23

How an image becomes language-model context

Learning goals

Trace image + text inference end to end.
Explain visual tokens and vision encoders.
Explain how visual features enter a language decoder.
Recognize VLM tasks and failure modes.

From text-only to multimodal

L21

text → tokens → LM → answer

L23

image + text → visual + text tokens → answer

A VLM is a language model with extra visual context.

Running example

CS486

What is shown in this image?

The model must combine visual evidence with language generation.

What makes this hard?

pixels

huge arrays

space

details have positions

language

answer must be text

The VLM pipeline

Image information becomes context the language model can use.

What gets loaded?

processor

image preprocessing

tokenizer

text tokens

vision encoder

visual features

decoder

generates text

Demo path and fallback

optional live path

Qwen3-VL + Transformers.js/WebGPU

reliable path

precomputed traces in slides

Optional browser code sketch

const processor = await AutoProcessor.from_pretrained(model_id);
const model = await Qwen3VLForConditionalGeneration.from_pretrained(model_id, {
  device: "webgpu",
  dtype: { vision_encoder: "fp16", decoder_model_merged: "q4f16" },
});

Conceptual code only: live VLM inference depends on browser and GPU support.

The same idea in Python

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-2B-Instruct",
    dtype="auto",
    device_map="auto",
)

Images must become tensors

Raw image files are not what the network consumes.

Patchify the image

A vision transformer treats patches like a sequence.

Patch vectors are visual tokens.

Patchify a real imagelive

Drag the grid: each patch becomes one visual token.

Interactive on the live slides: a real image is split into a grid of patches. A 16x16 grid gives 256 patches, and each patch becomes one visual token fed to the model — more patches means finer detail but a longer visual-token sequence.

The vision encoder extracts visual features

The image is processed before it reaches the language model.

Qwen-style VLMs use a dedicated vision encoder and a merger/projector.

Why CLIP matters

CLIP-style training aligns image and text embeddings.

It is the bridge from visual features to language semantics.

CLIP is not a full VLM

CLIP

compare image/text embeddings

VLM

generate text conditioned on image tokens

CLIP aligns image and textlive model

Which caption matches the image? CLIP scores them, no training needed.

Interactive on the live slides: real CLIP scores how well each caption matches an image. A portrait scores highest on "a portrait of a person", the chart on "a bar chart", the schedule on "a document with a schedule" — zero-shot, with no task-specific training. Type your own caption and CLIP scores it live in your browser.

The projector translates vision to LM space

visual features → MLP merger/projector → LM-sized visual tokens

Multimodal context sequence

img1img2img3Whatisshown?Alaptop

The decoder can attend to image tokens and text tokens in the same context.

Chat template with an image placeholder

user[image]

userDescribe this image.

assistant

The prompt is structured multimodal context, not just pixels.

Generation is still autoregressive

Trace: captioning

CS486

Describe this image.

A laptop and cup are sitting on a desk.

Trace: visual question answering

CS486

What object is next to the laptop?

A cup is next to the laptop.

Trace: document/OCR understanding

Course Schedule Assignment 2 due Thu Jul 9 Chat 9 out Thu Jul 9

When is Assignment 2 due?

Assignment 2 is due Thu Jul 9.

Trace: chart understanding

Which bar is largest?

Bar B is largest.

Trace: UI/screenshot understanding

Slides

Which button downloads the slides?

Click the PDF button.

The real model, across taskslive model

Actual Qwen3-VL-2B answers on real images: caption, VQA, OCR, chart.

Interactive on the live slides: switch between tasks and see a real vision-language model (Qwen3-VL-2B) answer questions about real images — describing a portrait, reading a schedule (OCR), and identifying the largest bar in a chart. These are the actual precomputed model outputs, not mockups.

Some VLMs can ground answers spatially

Instead of only text, a model may output a box or point.

CS486

Grounding is useful for documents, UI agents, and object localization.

What gets trained?

vision encoder

visual features

projector

align dimensions

decoder

language generation

A VLM can compile vision into weights

The small interpreter never sees pixels. The compiler carries visual information in the generated weights.

Result: beats the tested VLM baselines on three CoSyn diagram tasks.

Limit: long Im2LaTeX outputs suffer when examples crowd the context window.

VLM ability depends on multimodal data

captions

image-text pairs

documents

OCR/forms/tables

screens

UI/action traces

Failure mode: hallucinated visual details

The model may describe objects that are not present.

Fluent visual descriptions are not guaranteed faithful.

Failure mode: OCR errors

The image says Jul 9; the model reads Jul 8.

Reading text in images is powerful but brittle.

Failure mode: spatial reasoning

left/right

easy to flip

counting

small objects

relations

above/below

Failure mode: prompt sensitivity

describe

Describe what you see.

guess

Guess what is happening.

Visual answers can change with linguistic framing.

VLMs enable visual agents

observe screenshot → reason → click/type → observe again

Same tool/agent safety concerns from L22 still apply.

What did we add beyond text LMs?

L21

text tokens only

L23

image tokens + text tokens

Both generate text autoregressively.

Next question

Now models can understand images.

How do models generate images or future frames?

L24: Dissecting Diffusion and World Models.

Recap

✓ Images become tensors, patches, and visual tokens.
✓ A vision encoder extracts visual features.
✓ A projector maps visual features into LM space.
✓ The decoder attends to image and text context.
✓ VLMs can caption, answer, read, and reason visually.
✓ Failure modes include hallucination, OCR, spatial mistakes, and prompt sensitivity.