How to Build a Chatbot with the Claude API (2026 Tutorial)

Most "build a chatbot" tutorials give you a single-question toy that breaks the moment a real user types anything. This guide skips the shortcuts. You'll build a production-ready chatbot using the Anthropic Claude API — one that handles multi-turn conversations, streams responses token-by-token, respects a custom system prompt, and calls external tools when it needs live data.

By the end you'll have a working Python chatbot you can embed in a web app, Slack bot, or CLI tool — and you'll understand why each piece exists, which is what the Claude Certified Architect exam tests.

What You'll Build

A CLI chatbot with persistent conversation memory
Streaming output (tokens appear as Claude generates them)
A configurable system prompt for persona control
One tool integration (live weather via a mock function)
Clean error handling for rate limits and API errors

Prerequisites: Python 3.10+, an Anthropic API key, and basic familiarity with pip.

Step 1: Install the Anthropic SDK and Set Up Your Project

bashpip install anthropic python-dotenv

Create a .env file in your project root:

ANTHROPIC_API_KEY=sk-ant-...

Then create chatbot.py:

pythonimport os
from dotenv import load_dotenv
import anthropic

load_dotenv()
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

The Anthropic client is thread-safe and designed to be instantiated once. Don't create a new client per request in production — it re-reads credentials and opens new connections unnecessarily.

Step 2: Send Your First Message

The Claude API is a Messages API, not a completion API. Every call takes a list of messages and returns a response you append to that list. This is the mental model that makes multi-turn conversations trivial.

pythonresponse = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "What is the Anthropic Constitution?"}
    ]
)

print(response.content[0].text)

Run it. You should see Claude's answer. Notice what response contains:

pythonprint(response.stop_reason)   # "end_turn" or "max_tokens"
print(response.usage)         # input_tokens, output_tokens

Tracking usage per call is how you monitor costs. At claude-sonnet-4-6 pricing (roughly $3/M input, $15/M output), a 1,024-token response costs about $0.015 — trivial in testing, but it adds up at scale.

Step 3: Add Multi-Turn Conversation Memory

A chatbot that forgets what you just said isn't a chatbot — it's a fancy search box. Multi-turn memory in the Messages API is explicit: you maintain the conversation list yourself and pass the full history on every call.

pythondef chat(messages: list, user_input: str) -> str:
    """Add user message, call API, append assistant reply, return text."""
    messages.append({"role": "user", "content": user_input})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=messages,
    )

    assistant_message = response.content[0].text
    messages.append({"role": "assistant", "content": assistant_message})
    return assistant_message


def run_chatbot():
    messages = []
    print("Claude Chatbot — type 'quit' to exit\n")

    while True:
        user_input = input("You: ").strip()
        if user_input.lower() in ("quit", "exit"):
            break
        if not user_input:
            continue

        reply = chat(messages, user_input)
        print(f"\nClaude: {reply}\n")


if __name__ == "__main__":
    run_chatbot()

Run this and have a multi-turn conversation. Claude will remember context because the full messages list grows with each turn.

Token budget warning: The conversation list grows indefinitely. Claude Sonnet 4.6 has a 1M-token context window — generous, but a 10-hour customer support session will eventually hit it. Production chatbots use one of three strategies:

Strategy	How it works	Best for
Sliding window	Drop oldest messages when over threshold	Casual chat, support bots
Summary compression	Summarize old turns into one system message	Long-running assistants
Retrieval	Store turns in vector DB, inject relevant ones	Knowledge-heavy domains

For this tutorial we'll use a simple sliding window.

pythonMAX_TURNS = 20  # keep last 20 messages (10 user + 10 assistant)

def trim_history(messages: list) -> list:
    if len(messages) > MAX_TURNS:
        return messages[-MAX_TURNS:]
    return messages

Call messages = trim_history(messages) before each API call.

Step 4: Add a System Prompt for Persona Control

The system parameter is the single most powerful knob in the Claude API. It's not a first message — it's a persistent instruction layer that Claude weighs throughout the conversation.

pythonSYSTEM_PROMPT = """You are Aria, a friendly customer support assistant for AI for Anything (aiforanything.io).

Your role:
- Help users understand AI certifications (CCA, AWS AI Practitioner, Google AI)
- Answer questions about practice tests and study guides
- Keep answers concise (under 150 words) unless the user asks for detail
- Never make up pricing — direct pricing questions to the website

Tone: Warm, encouraging, technically accurate. Learners need confidence."""


response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=SYSTEM_PROMPT,   # <-- system param, not in messages list
    messages=messages,
)

Key system prompt rules that matter for production:

Role before rules — open with who the assistant is, then constrain behavior

Negative instructions work — "never make up pricing" is effective

Explicit format instructions — "under 150 words" shapes output length better than vague guidance

The system prompt is not secret — a determined user can often extract it. Don't put passwords or confidential logic here

Step 5: Stream Responses Token-by-Token

Nobody wants to stare at a blank screen for 3 seconds waiting for a 400-word response. Streaming makes your chatbot feel instant.

pythondef chat_stream(messages: list, user_input: str) -> str:
    messages.append({"role": "user", "content": user_input})
    full_response = ""

    with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=messages,
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
            full_response += text

    print()  # newline after streaming ends
    messages.append({"role": "assistant", "content": full_response})
    return full_response

The .text_stream iterator yields string chunks as they arrive. flush=True forces Python to print each chunk immediately rather than buffering. In a web app you'd send these chunks via Server-Sent Events (SSE) — the pattern is identical, just replace print with response.write.

Step 6: Add Tool Use (Function Calling)

Tool use lets Claude call functions you define — database lookups, API calls, calculations — and weave the results into its response. This is the feature that separates a chatbot from a real AI assistant.

Here's how it works:

You define tools (JSON schema describing function + parameters)

Claude decides when to call them

Your code executes the function

You return results to Claude, which generates the final response

pythonimport json

# Define the tool schema
tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city. Use when the user asks about weather.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {
                    "type": "string",
                    "description": "City name, e.g. 'San Francisco'"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "Temperature unit"
                }
            },
            "required": ["city"]
        }
    }
]


def get_weather(city: str, unit: str = "celsius") -> dict:
    """Mock weather function — replace with real API call."""
    return {"city": city, "temperature": 22, "unit": unit, "condition": "Partly cloudy"}


def chat_with_tools(messages: list, user_input: str) -> str:
    messages.append({"role": "user", "content": user_input})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        tools=tools,
        messages=messages,
    )

    # Check if Claude wants to use a tool
    while response.stop_reason == "tool_use":
        tool_uses = [b for b in response.content if b.type == "tool_use"]

        # Add Claude's tool-calling message to history
        messages.append({"role": "assistant", "content": response.content})

        # Execute each tool call
        tool_results = []
        for tool_use in tool_uses:
            if tool_use.name == "get_weather":
                result = get_weather(**tool_use.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": tool_use.id,
                    "content": json.dumps(result),
                })

        # Return results to Claude
        messages.append({"role": "user", "content": tool_results})

        # Get Claude's final response
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=SYSTEM_PROMPT,
            tools=tools,
            messages=messages,
        )

    # Extract final text response
    final_text = next(b.text for b in response.content if hasattr(b, "text"))
    messages.append({"role": "assistant", "content": final_text})
    return final_text

The while response.stop_reason == "tool_use" loop handles parallel tool calls — Claude can request multiple tools simultaneously, and you handle all of them before calling the API again.

Step 7: Handle Errors Gracefully

Production chatbots fail. Rate limits, network timeouts, invalid API keys — all of them will happen. The Anthropic SDK raises typed exceptions you can catch:

pythonfrom anthropic import (
    APIConnectionError,
    RateLimitError,
    APIStatusError,
    AuthenticationError,
)
import time

def safe_chat(messages: list, user_input: str, retries: int = 3) -> str:
    for attempt in range(retries):
        try:
            return chat_stream(messages, user_input)

        except RateLimitError:
            wait = 2 ** attempt  # exponential backoff: 1s, 2s, 4s
            print(f"Rate limited. Retrying in {wait}s...")
            time.sleep(wait)

        except AuthenticationError:
            raise ValueError("Invalid API key. Check your ANTHROPIC_API_KEY.")

        except APIConnectionError:
            print("Network error. Check your connection.")
            if attempt == retries - 1:
                raise

        except APIStatusError as e:
            print(f"API error {e.status_code}: {e.message}")
            raise

    raise RuntimeError("Max retries exceeded")

Exponential backoff on RateLimitError is the standard pattern — it's what the Anthropic cookbook recommends and what the CCA exam tests you on.

Complete Chatbot: Putting It All Together

Here's the final chatbot.py with all features integrated:

pythonimport os, json, time
from dotenv import load_dotenv
import anthropic
from anthropic import RateLimitError, AuthenticationError, APIConnectionError

load_dotenv()
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

SYSTEM_PROMPT = """You are Aria, a helpful AI assistant. Be concise, accurate, and friendly."""

MAX_TURNS = 20

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "input_schema": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"]
        }
    }
]

def get_weather(city: str) -> dict:
    return {"city": city, "temperature": 22, "condition": "Sunny"}  # replace with real API

def trim_history(messages: list) -> list:
    return messages[-MAX_TURNS:] if len(messages) > MAX_TURNS else messages

def chat(messages: list, user_input: str) -> str:
    messages = trim_history(messages)
    messages.append({"role": "user", "content": user_input})
    full_response = ""

    for attempt in range(3):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                system=SYSTEM_PROMPT,
                tools=tools,
                messages=messages,
            )
            break
        except RateLimitError:
            time.sleep(2 ** attempt)
    else:
        return "Sorry, I'm temporarily unavailable. Please try again."

    # Handle tool use
    while response.stop_reason == "tool_use":
        tool_uses = [b for b in response.content if b.type == "tool_use"]
        messages.append({"role": "assistant", "content": response.content})
        results = []
        for t in tool_uses:
            if t.name == "get_weather":
                results.append({
                    "type": "tool_result",
                    "tool_use_id": t.id,
                    "content": json.dumps(get_weather(**t.input))
                })
        messages.append({"role": "user", "content": results})
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=SYSTEM_PROMPT,
            tools=tools,
            messages=messages,
        )

    # Stream the final response
    full_response = next(b.text for b in response.content if hasattr(b, "text"))
    print(f"\nAria: {full_response}\n")
    messages.append({"role": "assistant", "content": full_response})
    return full_response


def main():
    messages = []
    print("Aria Chatbot — type 'quit' to exit\n")
    while True:
        user_input = input("You: ").strip()
        if user_input.lower() in ("quit", "exit"):
            break
        if user_input:
            chat(messages, user_input)

if __name__ == "__main__":
    main()

Choosing the Right Claude Model

Model	Best for	Approx. cost
`claude-haiku-4-5`	High-volume, simple Q&A, classification	Lowest
`claude-sonnet-4-6`	Most chatbots, balanced quality/cost	Mid
`claude-opus-4-6`	Complex reasoning, document analysis	Highest

For most customer-facing chatbots, start with Sonnet. Downgrade to Haiku for FAQ bots that handle thousands of requests per day. Use Opus only when the task genuinely requires deep reasoning — the cost difference is roughly 5x.

Key Takeaways

The Messages API is stateless — you own the conversation history and pass it on every call
The system parameter controls persona and constraints — it's separate from the messages list
Streaming requires minimal code changes: client.messages.stream() instead of client.messages.create()
Tool use follows a request→execute→return loop; the while stop_reason == "tool_use" pattern handles parallel calls
Always implement exponential backoff for RateLimitError in production

Go Deeper: Claude Certified Architect

Building chatbots with the Claude API is one of the core competencies tested in the Claude Certified Architect (CCA-F) exam. The exam covers:

Messages API design patterns and multi-turn architecture
Prompt engineering and system prompt design
Tool use schemas and agentic patterns
Context window management and token optimization
Safety best practices and constitutional AI

AI for Anything offers the most comprehensive CCA practice test bank available — 200+ questions organized by exam domain, with detailed explanations for every answer. Whether you're studying for the cert or building production AI apps, understanding these patterns deeply is what separates a developer who uses Claude from one who can architect with it.

Start your CCA prep →

How to Build a Chatbot with Claude API: Complete Tutorial (2026)