Claude Vision API: How to Analyze Images, Documents & Screenshots (2026 Guide)
Complete developer guide to Claude's vision API — pass images via URL or base64, analyze PDFs and screenshots, extract data from charts, and build production image pipelines. Python & JS examples.
Claude Vision API: Analyze Images, Documents & Screenshots Like a Pro
You have a folder of invoices, a dashboard screenshot, or a chart you want to interpret programmatically — and you want Claude to do the heavy lifting. Claude's vision capabilities let you pass images directly into the API and get back structured analysis, extracted text, or nuanced descriptions with a few lines of code.
This guide covers everything a developer needs: how the vision API works, how to pass images (URL vs base64), practical code in Python and JavaScript, real-world use cases, and the tips that actually improve output quality.
What Claude's Vision API Can Do
Claude's multimodal API accepts both text and images in a single request. Unlike dedicated OCR tools, Claude understands context — it doesn't just extract pixels, it reasons about what's in the image.
What it handles well:- Scanned documents and PDFs (invoices, contracts, receipts)
- Screenshots of UIs, dashboards, and error messages
- Charts, graphs, and data visualizations
- Handwritten notes (with reasonable accuracy)
- Product photos and e-commerce imagery
- Medical images (with appropriate prompting)
- Diagrams, flowcharts, and whiteboards
- Max image size: 5 MB per image
- Max images per request: 20 (Sonnet/Haiku), 5 (Opus)
- No video support — static frames only
- Precise pixel-level coordinates are unreliable; use for understanding, not measurement
Claude Sonnet 4.6 is the recommended model for vision tasks: it balances accuracy and cost well. Haiku is faster and cheaper for simple extraction; Opus is best for complex medical or legal documents.
Two Ways to Pass Images
Method 1: Image URL
If your image is publicly accessible, pass the URL directly. Claude fetches it at inference time.
pythonimport anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/chart.png",
},
},
{
"type": "text",
"text": "Describe the trend shown in this chart. What is the highest value and when does it occur?"
}
],
}
],
)
print(response.content[0].text)Method 2: Base64 Encoding
For local files, private images, or dynamic content, encode the image as base64.
pythonimport anthropic
import base64
from pathlib import Path
client = anthropic.Anthropic()
def analyze_local_image(image_path: str, question: str) -> str:
image_data = Path(image_path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")
# Detect media type from extension
extension = Path(image_path).suffix.lower()
media_type_map = {
".jpg": "image/jpeg",
".jpeg": "image/jpeg",
".png": "image/png",
".gif": "image/gif",
".webp": "image/webp",
}
media_type = media_type_map.get(extension, "image/jpeg")
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": base64_image,
},
},
{"type": "text", "text": question}
],
}
],
)
return response.content[0].text
# Usage
result = analyze_local_image("./invoice.png", "Extract all line items, quantities, and totals as JSON")
print(result)JavaScript / Node.js Example
javascriptimport Anthropic from "@anthropic-ai/sdk";
import fs from "fs";
const client = new Anthropic();
async function analyzeImage(imagePath, prompt) {
const imageBuffer = fs.readFileSync(imagePath);
const base64Image = imageBuffer.toString("base64");
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [
{
role: "user",
content: [
{
type: "image",
source: {
type: "base64",
media_type: "image/png",
data: base64Image,
},
},
{
type: "text",
text: prompt,
},
],
},
],
});
return response.content[0].text;
}
// Extract data from a screenshot
const result = await analyzeImage(
"./dashboard-screenshot.png",
"List all the KPI metrics visible in this dashboard with their current values."
);
console.log(result);4 High-Value Use Cases with Code
1. Invoice and Receipt Extraction
Automate accounts payable by extracting structured data from scanned documents.
pythondef extract_invoice_data(invoice_path: str) -> dict:
result = analyze_local_image(
invoice_path,
"""Extract invoice data as JSON with this exact structure:
{
"vendor": "company name",
"invoice_number": "...",
"date": "YYYY-MM-DD",
"due_date": "YYYY-MM-DD",
"line_items": [
{"description": "...", "quantity": 0, "unit_price": 0.00, "total": 0.00}
],
"subtotal": 0.00,
"tax": 0.00,
"total": 0.00,
"currency": "USD"
}
Return only valid JSON, no explanation."""
)
import json
return json.loads(result)"confidence": 0-100 fields if you need to flag uncertain extractions for human review.
2. Error Screenshot Triage
Feed error screenshots into Claude for automatic diagnosis — useful in support ticketing and CI pipelines.
pythondef triage_error_screenshot(screenshot_path: str) -> dict:
result = analyze_local_image(
screenshot_path,
"""Analyze this error screenshot and return JSON:
{
"error_type": "...",
"error_message": "exact text from screenshot",
"likely_cause": "...",
"suggested_fix": "...",
"severity": "low|medium|high|critical",
"requires_human": true/false
}"""
)
import json
return json.loads(result)3. Chart and Graph Interpretation
Extract insights from analytics dashboards without building custom chart parsers.
pythondef interpret_chart(chart_path: str, context: str = "") -> str:
prompt = f"""Analyze this data visualization.
{f'Context: {context}' if context else ''}
Provide:
1. Chart type and what it measures
2. Key trend or pattern (1-2 sentences)
3. The highest and lowest data points with approximate values
4. Any anomalies or notable inflection points
5. One actionable insight based on the data
"""
return analyze_local_image(chart_path, prompt)
# Example
insight = interpret_chart(
"./q2-revenue-chart.png",
context="This is our Q2 2026 revenue by product line"
)4. Multi-Image Comparison
Claude can handle multiple images in one request — useful for before/after comparisons, A/B test screenshots, or product catalog analysis.
pythondef compare_images(image_paths: list[str], comparison_prompt: str) -> str:
content = []
for i, path in enumerate(image_paths, 1):
image_data = Path(path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")
content.append({
"type": "text",
"text": f"Image {i}:"
})
content.append({
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": base64_image,
}
})
content.append({"type": "text", "text": comparison_prompt})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": content}]
)
return response.content[0].text
# A/B test screenshot comparison
result = compare_images(
["./variant-a.png", "./variant-b.png"],
"Compare these two landing page designs. Which has stronger visual hierarchy and CTA placement? Be specific."
)Working with PDFs and Multi-Page Documents
The Claude API accepts images, not raw PDFs. For PDFs, convert pages to images first. The pdf2image library makes this straightforward:
pythonfrom pdf2image import convert_from_path
import tempfile
import os
def analyze_pdf(pdf_path: str, question: str, pages: list[int] = None) -> str:
"""
Analyze specific pages from a PDF.
pages: list of 1-indexed page numbers (None = first 5 pages)
"""
images = convert_from_path(pdf_path, dpi=200)
if pages:
images = [images[p-1] for p in pages if p <= len(images)]
else:
images = images[:5] # Cap at 5 pages for cost control
content = []
for i, image in enumerate(images, 1):
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
image.save(tmp.name, "PNG")
image_data = Path(tmp.name).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")
os.unlink(tmp.name)
content.append({"type": "text", "text": f"Page {i}:"})
content.append({
"type": "image",
"source": {"type": "base64", "media_type": "image/png", "data": base64_image}
})
content.append({"type": "text", "text": question})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
messages=[{"role": "user", "content": content}]
)
return response.content[0].text
# Analyze a contract
summary = analyze_pdf(
"./contract.pdf",
"Summarize the key obligations, payment terms, and termination clauses.",
pages=[1, 2, 3]
)Prompting Tips That Actually Improve Vision Results
| Technique | What to Do | Why It Works |
|---|---|---|
| Be specific about structure | "Return JSON with fields: vendor, date, total" | Reduces hallucination, improves parsability |
| Reference image regions | "In the upper-right corner of the chart..." | Grounds Claude's attention |
| Provide context | "This is a financial dashboard for a SaaS company" | Claude uses domain knowledge more accurately |
| Ask for confidence | "If any field is unclear, set value to null and add a 'flags' array" | Surfaces uncertain extractions |
| Chain analysis | First ask "what type of document is this?", then ask specific questions | Works better for unknown document types |
| Use system prompts | Set domain context in the system prompt once | Reduces per-request overhead |
Cost and Performance Optimization
Vision requests cost more than text-only requests because images consume input tokens. Here's how to manage costs in production:
Image token estimation: The token count depends on image dimensions. A 1092×1092 image uses approximately 1,590 input tokens. Use the formula:tokens ≈ (width × height) / 750 for a rough estimate.
Resize before sending:
pythonfrom PIL import Image
import io
def resize_for_claude(image_path: str, max_dimension: int = 1568) -> bytes:
"""
Claude's optimal image dimension is ≤1568px on the longest side.
Larger images don't improve accuracy but cost more tokens.
"""
with Image.open(image_path) as img:
if max(img.size) > max_dimension:
img.thumbnail((max_dimension, max_dimension), Image.LANCZOS)
buffer = io.BytesIO()
img.save(buffer, format="PNG", optimize=True)
return buffer.getvalue()| Task | Recommended Model | Reason |
|---|---|---|
| Invoice/receipt extraction | claude-haiku-4-5 | Simple structured extraction |
| Chart interpretation | claude-sonnet-4-6 | Needs reasoning about trends |
| Legal/medical document review | claude-opus-4-8 | Maximum accuracy required |
| Bulk image classification | claude-haiku-4-5 | Speed + cost at scale |
| Multi-image comparison | claude-sonnet-4-6 | Balance of capability and cost |
pythonimport asyncio
from anthropic import AsyncAnthropic
async def batch_analyze_images(image_paths: list[str], prompt: str, concurrency: int = 5) -> list[str]:
client = AsyncAnthropic()
semaphore = asyncio.Semaphore(concurrency)
async def analyze_one(path):
async with semaphore:
image_data = Path(path).read_bytes()
base64_image = base64.standard_b64encode(image_data).decode("utf-8")
response = await client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": base64_image}},
{"type": "text", "text": prompt}
]
}]
)
return response.content[0].text
tasks = [analyze_one(path) for path in image_paths]
return await asyncio.gather(*tasks)
# Process 100 invoices concurrently (max 5 at a time)
results = asyncio.run(
batch_analyze_images(invoice_paths, "Extract vendor name and total amount as JSON", concurrency=5)
)Key Takeaways
- URL vs base64: Use URL for public images, base64 for local/private files. Both work equally well for accuracy.
- Resize to ≤1568px: Larger images don't improve accuracy but increase cost. The sweet spot is 1092×1092.
- Always specify output format: Ask for JSON with a defined schema for extraction tasks. Unstructured output is harder to use in production.
- Match model to task: Haiku for bulk extraction, Sonnet for analysis, Opus for high-stakes documents.
- Convert PDFs to PNG at 200 DPI: That's the sweet spot between accuracy and file size for most document types.
- Use async for batch jobs: The
AsyncAnthropicclient lets you process dozens of images concurrently without hitting rate limits.
Next Steps
If you're integrating Claude vision into a production pipeline, the next skill to add is structured output with tool use — Claude can call a function that validates the extracted schema before returning, which eliminates the need for manual JSON parsing and error handling on your end.
Want to put your Claude API knowledge to the test? AI for Anything's Claude Certified Architect practice tests include vision API questions drawn from the official CCA-F exam blueprint. The $19.99 test bank covers all multimodal, tool use, and agent patterns — the exact topics that show up on the certification.
Related reads:
Ready to Start Practicing?
300+ scenario-based practice questions covering all 5 CCA domains. Detailed explanations for every answer.
Free CCA Study Kit
Get domain cheat sheets, anti-pattern flashcards, and weekly exam tips. No spam, unsubscribe anytime.