API Primer — Integrating Claude into Your Application
With Anthropic's Claude API, you can embed Claude's AI capabilities into your own applications and services. It opens up possibilities beyond what a chat UI can offer, including automating backend processing, building custom interfaces, and batch-processing large volumes of text.
This article covers everything from obtaining an API key to sending requests, Function Calling, streaming, and cost optimization, with practical examples in Python and TypeScript.
API Overview
Endpoint Structure
The base URL for the Claude API is https://api.anthropic.com/v1. The main endpoints are as follows.
| Endpoint | Purpose |
|---|---|
POST /v1/messages | Text generation (the most fundamental API) |
POST /v1/messages/batches | Batch processing (asynchronous handling of large volumes of requests) |
POST /v1/messages/count_tokens | Count tokens in advance |
Getting an API Key
Visit the Anthropic Console and create an account. You can generate a new key in the "API Keys" section.
Keep your API key strictly confidential. If your API key is leaked, third parties can send requests on your account and incur charges. Follow these rules:
- Do not hardcode the API key in your source code
- Do not commit it to a Git repository (especially a public one)
- Add
.envfiles to.gitignore - In production, use environment variables or a Secret Manager (AWS Secrets Manager, Google Secret Manager, etc.)
Rate Limits
The Claude API has rate limits on the number of requests (RPM) and tokens (TPM). Limits vary depending on your account usage and tier. When a limit is reached, the API returns 429 Too Many Requests.
Common pitfall: If you see frequent
429errors in a production app, add retry logic with exponential backoff (gradually increasing the retry interval). This feature is built into Anthropic's official SDKs.
Quickstart
Installing the SDK
First, install Anthropic's official SDK.
Python:
pip install anthropic
TypeScript / Node.js:
npm install @anthropic-ai/sdk
Setting the Environment Variable
Set your API key as an environment variable.
export ANTHROPIC_API_KEY="sk-ant-..."
In production apps, it is recommended to use a .env file and add it to .gitignore.
# .env
ANTHROPIC_API_KEY=sk-ant-...
Your First Request
Python:
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[
{"role": "user", "content": "Hello, Claude!"}
]
)
print(message.content[0].text)
TypeScript:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const message = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: "Hello, Claude!" }],
});
console.log(message.content[0].text);
cURL (handy for verifying API behavior):
curl https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{
"model": "claude-opus-4-6",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Hello, Claude!"}
]
}'
The response comes back as JSON in the following format.
{
"id": "msg_01XFDUDYJgAACzvnptvVoYEL",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Hello! How can I assist you today?"
}
],
"model": "claude-opus-4-6",
"stop_reason": "end_turn",
"usage": {
"input_tokens": 10,
"output_tokens": 12
}
}
Messages API Basics
Request Parameters
The main parameters for the /v1/messages endpoint are as follows.
| Parameter | Required | Description |
|---|---|---|
model | Yes | The model ID to use (e.g., claude-opus-4-6) |
max_tokens | Yes | Maximum number of tokens to generate |
messages | Yes | Array of conversation history (role and content pairs) |
system | No | System prompt (sets the model's role and constraints) |
temperature | No | Randomness of output (0 to 1, default 1) |
top_p | No | Nucleus sampling threshold |
stop_sequences | No | Array of sequences that stop generation |
stream | No | Whether to enable streaming (true/false) |
System Prompt
Using a system prompt allows you to set a role and constraints for the model.
Python:
message = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system="You are a customer support agent. Reply politely and concisely.",
messages=[
{"role": "user", "content": "What is your return policy?"}
]
)
TypeScript:
const message = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 1024,
system: "You are a customer support agent. Reply politely and concisely.",
messages: [{ role: "user", content: "What is your return policy?" }],
});
Multi-Turn Conversations
Multi-turn conversations are achieved by accumulating conversation history in the messages array.
Python:
conversation_history = []
def chat(user_message: str) -> str:
conversation_history.append({
"role": "user",
"content": user_message
})
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=conversation_history
)
assistant_message = response.content[0].text
conversation_history.append({
"role": "assistant",
"content": assistant_message
})
return assistant_message
# Usage example
print(chat("Please introduce yourself."))
print(chat("What are you good at?"))
TypeScript:
const conversationHistory: { role: "user" | "assistant"; content: string }[] =
[];
async function chat(userMessage: string): Promise<string> {
conversationHistory.push({ role: "user", content: userMessage });
const response = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 1024,
messages: conversationHistory,
});
const assistantMessage = (response.content[0] as { text: string }).text;
conversationHistory.push({ role: "assistant", content: assistantMessage });
return assistantMessage;
}
// Usage example
console.log(await chat("Please introduce yourself."));
console.log(await chat("What are you good at?"));
Common pitfall: Continuously accumulating conversation history will eventually exceed the context window limit of a given request (which varies by model — Claude Opus 4.6 / Sonnet 4.6 support up to 1 million tokens, while Haiku 4.5 supports up to 200K tokens). For long-running conversations, you need a design that removes old messages or summarizes them before passing them to the next request.
When to Use temperature
temperature controls the diversity of the output.
| temperature | Suitable use cases |
|---|---|
0.0 | Analysis, classification, fact-checking (when deterministic output is needed) |
0.3 to 0.7 | Chatbots, Q&A (balanced output) |
0.8 to 1.0 | Creative writing, brainstorming (when varied output is desired) |
Function Calling (Tool Use)
Function Calling (Tool Use) is a feature that gives Claude the ability to call external functions or tools. For example, you can integrate processing that Claude cannot do on its own, such as retrieving current weather data, searching a database, or running calculations.
Execution Flow
- The app sends a request containing "tool definitions"
- When Claude determines it needs to call a tool, it returns a
tool_useblock - The app executes the actual function and returns the result to Claude
- Claude uses the result to generate the final answer
Writing Tool Definitions
Python:
tools = [
{
"name": "get_weather",
"description": "Retrieves the current weather information for a specified city.",
"input_schema": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The name of the city to check the weather for (e.g., Tokyo, London)"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The unit for temperature"
}
},
"required": ["city"]
}
}
]
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
tools=tools,
messages=[
{"role": "user", "content": "What is the weather like in Tokyo today?"}
]
)
# Handling a tool call response
if response.stop_reason == "tool_use":
tool_use_block = next(
block for block in response.content if block.type == "tool_use"
)
tool_name = tool_use_block.name
tool_input = tool_use_block.input
tool_use_id = tool_use_block.id
# Actual tool processing (using a mock response here)
if tool_name == "get_weather":
tool_result = {
"temperature": 22,
"condition": "Sunny",
"humidity": 60
}
# Return the tool execution result to Claude
final_response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
tools=tools,
messages=[
{"role": "user", "content": "What is the weather like in Tokyo today?"},
{"role": "assistant", "content": response.content},
{
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": tool_use_id,
"content": str(tool_result)
}
]
}
]
)
print(final_response.content[0].text)
TypeScript:
const tools: Anthropic.Messages.Tool[] = [
{
name: "get_weather",
description: "Retrieves the current weather information for a specified city.",
input_schema: {
type: "object",
properties: {
city: {
type: "string",
description: "The name of the city to check the weather for (e.g., Tokyo, London)",
},
unit: {
type: "string",
enum: ["celsius", "fahrenheit"],
description: "The unit for temperature",
},
},
required: ["city"],
},
},
];
const response = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 1024,
tools,
messages: [{ role: "user", content: "What is the weather like in Tokyo today?" }],
});
if (response.stop_reason === "tool_use") {
const toolUseBlock = response.content.find(
(block): block is Anthropic.Messages.ToolUseBlock =>
block.type === "tool_use"
)!;
// Actual tool processing (using a mock response here)
const toolResult = { temperature: 22, condition: "Sunny", humidity: 60 };
const finalResponse = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 1024,
tools,
messages: [
{ role: "user", content: "What is the weather like in Tokyo today?" },
{ role: "assistant", content: response.content },
{
role: "user",
content: [
{
type: "tool_result",
tool_use_id: toolUseBlock.id,
content: JSON.stringify(toolResult),
},
],
},
],
});
console.log(finalResponse.content[0]);
}
Common pitfall: If
stop_reasonis neither"tool_use"nor"end_turn"(e.g.,"max_tokens"), generation was cut off without atool_useblock being returned. Increasemax_tokensor add logic to determine in advance whether a tool call is needed.
Streaming
With streaming, you can receive tokens one by one as the response is generated, before it is complete. In chat apps or long-form generation, this allows you to display results to users in real time without having them wait.
The Claude API supports streaming in SSE (Server-Sent Events) format.
Python:
with client.messages.stream(
model="claude-opus-4-6",
max_tokens=1024,
messages=[
{"role": "user", "content": "Please explain quantum computing in detail."}
]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
print() # Final newline
# Get the final message after the stream completes
final_message = stream.get_final_message()
print(f"\nToken usage: {final_message.usage}")
TypeScript:
const stream = await client.messages.stream({
model: "claude-opus-4-6",
max_tokens: 1024,
messages: [
{
role: "user",
content: "Please explain quantum computing in detail.",
},
],
});
for await (const chunk of stream) {
if (
chunk.type === "content_block_delta" &&
chunk.delta.type === "text_delta"
) {
process.stdout.write(chunk.delta.text);
}
}
// Get the final message after the stream completes
const finalMessage = await stream.getFinalMessage();
console.log("\nToken usage:", finalMessage.usage);
Server-Sent Events Implementation in Next.js
TypeScript (Next.js App Router):
// app/api/chat/route.ts
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
export async function POST(req: Request) {
const { messages } = await req.json();
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
const anthropicStream = await client.messages.stream({
model: "claude-opus-4-6",
max_tokens: 1024,
messages,
});
for await (const chunk of anthropicStream) {
if (
chunk.type === "content_block_delta" &&
chunk.delta.type === "text_delta"
) {
controller.enqueue(encoder.encode(chunk.delta.text));
}
}
controller.close();
},
});
return new Response(stream, {
headers: { "Content-Type": "text/plain; charset=utf-8" },
});
}
Extended Thinking
Extended Thinking is a feature that allows Claude to internally develop a thinking process before producing the final answer. It is particularly effective for tasks requiring high accuracy, such as complex math problems, multi-step reasoning, and detailed analysis.
Enable it by specifying a thinking parameter with budget_tokens (the maximum number of tokens Claude can use for thinking).
Python:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000 # Number of tokens available for the thinking process
},
messages=[
{
"role": "user",
"content": "Please provide the following mathematical proof: prove that there are infinitely many prime numbers."
}
]
)
# The response contains both a thinking block and a text block
for block in response.content:
if block.type == "thinking":
print("=== Thinking Process ===")
print(block.thinking)
elif block.type == "text":
print("=== Final Answer ===")
print(block.text)
TypeScript:
const response = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 16000,
thinking: {
type: "enabled",
budget_tokens: 10000,
},
messages: [
{
role: "user",
content:
"Please provide the following mathematical proof: prove that there are infinitely many prime numbers.",
},
],
});
for (const block of response.content) {
if (block.type === "thinking") {
console.log("=== Thinking Process ===");
console.log(block.thinking);
} else if (block.type === "text") {
console.log("=== Final Answer ===");
console.log(block.text);
}
}
Extended Thinking is only available on supported models such as claude-opus-4-6. Increasing budget_tokens improves reasoning accuracy but also increases cost. Set an appropriate value based on your use case.
Batch Processing
When processing large volumes of requests, the Message Batches API is the efficient choice. It offers up to 50% cost savings compared to the standard API and processes requests asynchronously in the background.
Submitting a Batch Request
Python:
# Creating a batch job
batch = client.messages.batches.create(
requests=[
{
"custom_id": "request-1",
"params": {
"model": "claude-opus-4-6",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Tell me about Tokyo."}
]
}
},
{
"custom_id": "request-2",
"params": {
"model": "claude-opus-4-6",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Tell me about London."}
]
}
}
]
)
print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")
TypeScript:
const batch = await client.messages.batches.create({
requests: [
{
custom_id: "request-1",
params: {
model: "claude-opus-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: "Tell me about Tokyo." }],
},
},
{
custom_id: "request-2",
params: {
model: "claude-opus-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: "Tell me about London." }],
},
},
],
});
console.log(`Batch ID: ${batch.id}`);
console.log(`Status: ${batch.processing_status}`);
Retrieving Batch Results
Batches can take anywhere from a few minutes to several hours to complete. Poll for the status to check progress.
Python:
import time
# Wait for batch completion (polling)
while True:
batch = client.messages.batches.retrieve(batch.id)
if batch.processing_status == "ended":
break
print(f"Processing... ({batch.request_counts.processing} remaining)")
time.sleep(60) # Wait 1 minute and check again
# Retrieve results
for result in client.messages.batches.results(batch.id):
print(f"ID: {result.custom_id}")
if result.result.type == "succeeded":
print(result.result.message.content[0].text)
else:
print(f"Error: {result.result.error}")
Common pitfall: Batch processing completes within 24 hours; beyond that it times out. If you have a large number of items to process, consider splitting them into multiple smaller batches. The maximum number of requests per batch is 10,000.
Cost Optimization
Pricing Overview
Claude's API pricing is usage-based, charged per input and output token. The following is a rough guide as of March 2026. Always check the Anthropic official pricing page for the latest rates.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|
| Claude Haiku 4.5 | $1.00 | $5.00 | Fastest and lowest cost. Best for simple tasks |
| Claude Sonnet 4.6 | $3.00 | $15.00 | Excellent balance of performance and cost |
| Claude Opus 4.6 | $5.00 | $25.00 | Highest performance. Best for complex reasoning and analysis |
The prices above are estimates. Anthropic may change its pricing. Check actual rates at https://www.anthropic.com/pricing.
Reducing Costs with Prompt Caching
Prompt Caching is a feature that caches long system prompts or context that you use repeatedly. When the cache is hit, the cost for input tokens is significantly reduced (approximately 90% off).
It is effective in cases where the same content is sent every time, such as long system prompts, referencing large volumes of documents, or repeating few-shot examples.
Python:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an AI assistant with deep expertise in law.\n\n" + very_long_legal_document, # Long reference document
"cache_control": {"type": "ephemeral"} # Enable caching
}
],
messages=[
{"role": "user", "content": "Please explain the interpretation of Article 3."}
]
)
# Check cache usage
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
Model Selection Guidelines
| Use Case | Recommended Model | Reason |
|---|---|---|
| Classification, routing, simple Q&A | Claude Haiku 4.5 | Prioritizes speed and cost |
| Chatbots, text generation, code completion | Claude Sonnet 4.6 | Best overall balance |
| Complex reasoning, analysis, research | Claude Opus 4.6 | Accuracy is the top priority |
| Batch processing (large volumes of simple tasks) | Claude Haiku 4.5 | Low-cost, high-volume processing |
Checking Token Counts in Advance
Before sending a request, you can use /v1/messages/count_tokens to check the token count.
Python:
token_count = client.messages.count_tokens(
model="claude-opus-4-6",
messages=[
{"role": "user", "content": "How many tokens does this request use?"}
]
)
print(f"Input token count: {token_count.input_tokens}")
Next Steps
Now that you understand the basics of the Claude API, expand your usage further.
- MCP Guide — How to connect external tools to Claude using the Model Context Protocol
- Anthropic Official API Reference — Detailed specifications for all parameters
- Anthropic Cookbook (GitHub) — Practical recipes and sample code
- Plan Comparison — Differences between the API and a claude.ai subscription