Claude Vision API: Analyze Images, Documents & Screenshots Like a Pro

You have a folder of invoices, a dashboard screenshot, or a chart you want to interpret programmatically — and you want Claude to do the heavy lifting. Claude's vision capabilities let you pass images directly into the API and get back structured analysis, extracted text, or nuanced descriptions with a few lines of code.

This guide covers everything a developer needs: how the vision API works, how to pass images (URL vs base64), practical code in Python and JavaScript, real-world use cases, and the tips that actually improve output quality.

What Claude's Vision API Can Do

Claude's multimodal API accepts both text and images in a single request. Unlike dedicated OCR tools, Claude understands context — it doesn't just extract pixels, it reasons about what's in the image.

What it handles well:

Scanned documents and PDFs (invoices, contracts, receipts)
Screenshots of UIs, dashboards, and error messages
Charts, graphs, and data visualizations
Handwritten notes (with reasonable accuracy)
Product photos and e-commerce imagery
Medical images (with appropriate prompting)
Diagrams, flowcharts, and whiteboards

Current limitations to know:

Max image size: 5 MB per image
Max images per request: 20 (Sonnet/Haiku), 5 (Opus)
No video support — static frames only
Precise pixel-level coordinates are unreliable; use for understanding, not measurement

Claude Sonnet 4.6 is the recommended model for vision tasks: it balances accuracy and cost well. Haiku is faster and cheaper for simple extraction; Opus is best for complex medical or legal documents.

Two Ways to Pass Images

Method 1: Image URL

If your image is publicly accessible, pass the URL directly. Claude fetches it at inference time.

pythonimport anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "url",
                        "url": "https://example.com/chart.png",
                    },
                },
                {
                    "type": "text",
                    "text": "Describe the trend shown in this chart. What is the highest value and when does it occur?"
                }
            ],
        }
    ],
)

print(response.content[0].text)

When to use URLs: Publicly hosted images, CDN assets, web scraping pipelines where you already have the URL. Avoids base64 encoding overhead.

Method 2: Base64 Encoding

For local files, private images, or dynamic content, encode the image as base64.

pythonimport anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def analyze_local_image(image_path: str, question: str) -> str:
    image_data = Path(image_path).read_bytes()
    base64_image = base64.standard_b64encode(image_data).decode("utf-8")
    
    # Detect media type from extension
    extension = Path(image_path).suffix.lower()
    media_type_map = {
        ".jpg": "image/jpeg",
        ".jpeg": "image/jpeg", 
        ".png": "image/png",
        ".gif": "image/gif",
        ".webp": "image/webp",
    }
    media_type = media_type_map.get(extension, "image/jpeg")
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": media_type,
                            "data": base64_image,
                        },
                    },
                    {"type": "text", "text": question}
                ],
            }
        ],
    )
    return response.content[0].text

# Usage
result = analyze_local_image("./invoice.png", "Extract all line items, quantities, and totals as JSON")
print(result)

Supported formats: JPEG, PNG, GIF, WebP. For PDFs, convert to PNG first (see the document section below).

JavaScript / Node.js Example

javascriptimport Anthropic from "@anthropic-ai/sdk";
import fs from "fs";

const client = new Anthropic();

async function analyzeImage(imagePath, prompt) {
  const imageBuffer = fs.readFileSync(imagePath);
  const base64Image = imageBuffer.toString("base64");
  
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: {
              type: "base64",
              media_type: "image/png",
              data: base64Image,
            },
          },
          {
            type: "text",
            text: prompt,
          },
        ],
      },
    ],
  });
  
  return response.content[0].text;
}

// Extract data from a screenshot
const result = await analyzeImage(
  "./dashboard-screenshot.png",
  "List all the KPI metrics visible in this dashboard with their current values."
);
console.log(result);

4 High-Value Use Cases with Code

1. Invoice and Receipt Extraction

Automate accounts payable by extracting structured data from scanned documents.

pythondef extract_invoice_data(invoice_path: str) -> dict:
    result = analyze_local_image(
        invoice_path,
        """Extract invoice data as JSON with this exact structure:
        {
          "vendor": "company name",
          "invoice_number": "...",
          "date": "YYYY-MM-DD",
          "due_date": "YYYY-MM-DD",
          "line_items": [
            {"description": "...", "quantity": 0, "unit_price": 0.00, "total": 0.00}
          ],
          "subtotal": 0.00,
          "tax": 0.00,
          "total": 0.00,
          "currency": "USD"
        }
        Return only valid JSON, no explanation."""
    )
    import json
    return json.loads(result)

Accuracy tip: Specifying the exact JSON structure in the prompt dramatically improves consistency. Add "confidence": 0-100 fields if you need to flag uncertain extractions for human review.

2. Error Screenshot Triage

Feed error screenshots into Claude for automatic diagnosis — useful in support ticketing and CI pipelines.

pythondef triage_error_screenshot(screenshot_path: str) -> dict:
    result = analyze_local_image(
        screenshot_path,
        """Analyze this error screenshot and return JSON:
        {
          "error_type": "...",
          "error_message": "exact text from screenshot",
          "likely_cause": "...",
          "suggested_fix": "...",
          "severity": "low|medium|high|critical",
          "requires_human": true/false
        }"""
    )
    import json
    return json.loads(result)

3. Chart and Graph Interpretation

Extract insights from analytics dashboards without building custom chart parsers.

pythondef interpret_chart(chart_path: str, context: str = "") -> str:
    prompt = f"""Analyze this data visualization.
    {f'Context: {context}' if context else ''}
    
    Provide:
    1. Chart type and what it measures
    2. Key trend or pattern (1-2 sentences)
    3. The highest and lowest data points with approximate values
    4. Any anomalies or notable inflection points
    5. One actionable insight based on the data
    """
    return analyze_local_image(chart_path, prompt)

# Example
insight = interpret_chart(
    "./q2-revenue-chart.png",
    context="This is our Q2 2026 revenue by product line"
)

4. Multi-Image Comparison

Claude can handle multiple images in one request — useful for before/after comparisons, A/B test screenshots, or product catalog analysis.

pythondef compare_images(image_paths: list[str], comparison_prompt: str) -> str:
    content = []
    
    for i, path in enumerate(image_paths, 1):
        image_data = Path(path).read_bytes()
        base64_image = base64.standard_b64encode(image_data).decode("utf-8")
        
        content.append({
            "type": "text",
            "text": f"Image {i}:"
        })
        content.append({
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": base64_image,
            }
        })
    
    content.append({"type": "text", "text": comparison_prompt})
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

# A/B test screenshot comparison
result = compare_images(
    ["./variant-a.png", "./variant-b.png"],
    "Compare these two landing page designs. Which has stronger visual hierarchy and CTA placement? Be specific."
)

Working with PDFs and Multi-Page Documents

The Claude API accepts images, not raw PDFs. For PDFs, convert pages to images first. The pdf2image library makes this straightforward:

pythonfrom pdf2image import convert_from_path
import tempfile
import os

def analyze_pdf(pdf_path: str, question: str, pages: list[int] = None) -> str:
    """
    Analyze specific pages from a PDF.
    pages: list of 1-indexed page numbers (None = first 5 pages)
    """
    images = convert_from_path(pdf_path, dpi=200)
    
    if pages:
        images = [images[p-1] for p in pages if p <= len(images)]
    else:
        images = images[:5]  # Cap at 5 pages for cost control
    
    content = []
    for i, image in enumerate(images, 1):
        with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
            image.save(tmp.name, "PNG")
            image_data = Path(tmp.name).read_bytes()
            base64_image = base64.standard_b64encode(image_data).decode("utf-8")
            os.unlink(tmp.name)
        
        content.append({"type": "text", "text": f"Page {i}:"})
        content.append({
            "type": "image",
            "source": {"type": "base64", "media_type": "image/png", "data": base64_image}
        })
    
    content.append({"type": "text", "text": question})
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

# Analyze a contract
summary = analyze_pdf(
    "./contract.pdf",
    "Summarize the key obligations, payment terms, and termination clauses.",
    pages=[1, 2, 3]
)

DPI guidance: 150 DPI is minimum for text extraction. Use 200-300 DPI for documents with small fonts or handwriting. Higher DPI increases image size and API cost.

Prompting Tips That Actually Improve Vision Results

Technique	What to Do	Why It Works
Be specific about structure	"Return JSON with fields: vendor, date, total"	Reduces hallucination, improves parsability
Reference image regions	"In the upper-right corner of the chart..."	Grounds Claude's attention
Provide context	"This is a financial dashboard for a SaaS company"	Claude uses domain knowledge more accurately
Ask for confidence	"If any field is unclear, set value to null and add a 'flags' array"	Surfaces uncertain extractions
Chain analysis	First ask "what type of document is this?", then ask specific questions	Works better for unknown document types
Use system prompts	Set domain context in the system prompt once	Reduces per-request overhead

The most common mistake: Asking Claude to extract data without specifying the output format. Unstructured responses are harder to parse and less consistent across runs. Always specify JSON schema or a table format for extraction tasks.

Cost and Performance Optimization

Vision requests cost more than text-only requests because images consume input tokens. Here's how to manage costs in production:

Image token estimation: The token count depends on image dimensions. A 1092×1092 image uses approximately 1,590 input tokens. Use the formula: tokens ≈ (width × height) / 750 for a rough estimate. Resize before sending:

pythonfrom PIL import Image
import io

def resize_for_claude(image_path: str, max_dimension: int = 1568) -> bytes:
    """
    Claude's optimal image dimension is ≤1568px on the longest side.
    Larger images don't improve accuracy but cost more tokens.
    """
    with Image.open(image_path) as img:
        if max(img.size) > max_dimension:
            img.thumbnail((max_dimension, max_dimension), Image.LANCZOS)
        
        buffer = io.BytesIO()
        img.save(buffer, format="PNG", optimize=True)
        return buffer.getvalue()

Model selection by task:

Task	Recommended Model	Reason
Invoice/receipt extraction	claude-haiku-4-5	Simple structured extraction
Chart interpretation	claude-sonnet-4-6	Needs reasoning about trends
Legal/medical document review	claude-opus-4-8	Maximum accuracy required
Bulk image classification	claude-haiku-4-5	Speed + cost at scale
Multi-image comparison	claude-sonnet-4-6	Balance of capability and cost

Batch processing with concurrency control:

pythonimport asyncio
from anthropic import AsyncAnthropic

async def batch_analyze_images(image_paths: list[str], prompt: str, concurrency: int = 5) -> list[str]:
    client = AsyncAnthropic()
    semaphore = asyncio.Semaphore(concurrency)
    
    async def analyze_one(path):
        async with semaphore:
            image_data = Path(path).read_bytes()
            base64_image = base64.standard_b64encode(image_data).decode("utf-8")
            
            response = await client.messages.create(
                model="claude-haiku-4-5-20251001",
                max_tokens=512,
                messages=[{
                    "role": "user",
                    "content": [
                        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": base64_image}},
                        {"type": "text", "text": prompt}
                    ]
                }]
            )
            return response.content[0].text
    
    tasks = [analyze_one(path) for path in image_paths]
    return await asyncio.gather(*tasks)

# Process 100 invoices concurrently (max 5 at a time)
results = asyncio.run(
    batch_analyze_images(invoice_paths, "Extract vendor name and total amount as JSON", concurrency=5)
)

Key Takeaways

URL vs base64: Use URL for public images, base64 for local/private files. Both work equally well for accuracy.
Resize to ≤1568px: Larger images don't improve accuracy but increase cost. The sweet spot is 1092×1092.
Always specify output format: Ask for JSON with a defined schema for extraction tasks. Unstructured output is harder to use in production.
Match model to task: Haiku for bulk extraction, Sonnet for analysis, Opus for high-stakes documents.
Convert PDFs to PNG at 200 DPI: That's the sweet spot between accuracy and file size for most document types.
Use async for batch jobs: The AsyncAnthropic client lets you process dozens of images concurrently without hitting rate limits.

Next Steps

If you're integrating Claude vision into a production pipeline, the next skill to add is structured output with tool use — Claude can call a function that validates the extracted schema before returning, which eliminates the need for manual JSON parsing and error handling on your end.

Want to put your Claude API knowledge to the test? AI for Anything's Claude Certified Architect practice tests include vision API questions drawn from the official CCA-F exam blueprint. The $19.99 test bank covers all multimodal, tool use, and agent patterns — the exact topics that show up on the certification.

Claude Vision API: How to Analyze Images, Documents & Screenshots (2026 Guide)