Calling AI APIs — Core Concepts
- Describe the five components of the AI API call model and what each one does
- Explain why the system prompt is load-bearing code that must be versioned and tested
- Use provider-native structured output instead of relying on text-based JSON instructions
- Set temperature correctly for classification and extraction tasks versus creative tasks
The Same Model Underneath
Despite different documentation styles and client libraries, the three major AI APIs — Anthropic's Claude, OpenAI, and Google's Gemini — share the same conceptual structure. Once you understand the model, switching between providers is mostly a matter of translating function names. Learning one well teaches you all three.
The core model has five components: the messages array, the system instruction, tool definitions, sampling parameters, and the response. We will go through each, because each one has a decision or two that significantly affects your output quality.
The Messages Array
Every API call includes a messages array — an ordered list of turns in a conversation, alternating between user and assistant. In all three APIs:
- Each message has a role (
userorassistant) and content (text, or a mix of text and other content types) - The conversation history is stateless — you send the full history with every request. The API does not remember previous calls.
- You are responsible for managing the message history in your application code
For a single-turn request (the most common case), the messages array contains exactly one message: the user's input. For a multi-turn conversation, you append each user message and each assistant response to the array and send the updated history each time.
The System Prompt: Load-Bearing Code
The system instruction (called system in the Claude API, also a top-level parameter in Gemini rather than a message role) defines the AI's behaviour for the entire conversation. In a chat interface, a system prompt is a suggestion. In a production application, the system prompt is load-bearing code.
It defines the complete behavioural contract: what format to return, what tone to use, what the model must refuse, what domain rules to apply, and what schema to follow. Treat it like a function signature — it needs to be tested, versioned, and reviewed when you change it. A team that changes the system prompt without running it through a test suite has changed their application's behaviour without testing that change.
Concretely: if you are building a support ticket classifier, your system prompt specifies the valid categories, the output format, what to do with ambiguous tickets, and what to do if the input is not a support ticket at all. This is business logic. Write it carefully and version it.
Structured Output: Use the Native Methods
One of the most common mistakes in early AI feature development is asking the model to "respond in JSON" in the system prompt and then parsing the response as JSON. This works most of the time — and fails 8–15% of the time when the model adds a commentary sentence, wraps the JSON in a code block, or produces malformed JSON for long outputs.
All three major APIs now provide native structured output that uses constrained decoding — the model is literally constrained to produce only tokens that form valid JSON matching your schema. The failure rate drops to near zero. The methods:
- Claude: Define a tool with an
input_schemaand usetool_use— the model's response will populate the schema fields - OpenAI: Use
response_formatwithtype: "json_schema"andstrict: true - Gemini: Use
responseMimeType: "application/json"with an optionalresponseSchema
Use these native methods for any feature where you need to parse the AI's response. Never rely on "please respond in JSON" in the system prompt for production code.
Temperature: The Most Misunderstood Parameter
Temperature controls how random the model's token selection is. At temperature 0, the model is nearly deterministic — it picks the most likely next token every time. At higher temperatures, it selects from a broader distribution, producing more varied and sometimes more creative outputs.
For production features, the rule of thumb is:
- Temperature 0–0.3: Classification, entity extraction, data transformation, anything where you want consistent and predictable output. Run the same input twice and get the same output.
- Temperature 0.7–1.0: Creative writing, brainstorming, generating variations, anything where you want diverse output. Run the same input twice and get different output.
Developers often leave temperature at the API default when building their first features. For classification and extraction tasks, explicitly setting temperature: 0 is one of the quickest ways to make output more reliable.
Few-Shot Examples: In-Prompt Training
The fastest way to improve AI output consistency without fine-tuning is to embed 2–5 representative input/output examples directly in the system prompt. This is called few-shot prompting. The model uses your examples as a template for the format, tone, and handling of edge cases you demonstrate.
For a support ticket classifier, your examples might show: a billing ticket correctly classified as "billing", a technical issue classified as "technical", and an ambiguous message classified as "other" with a confidence field set to "low". The model learns from these examples what your edge case handling looks like — no fine-tuning, no separate training step, just examples in the prompt. Few-shot examples are the most underused tool for improving AI output quality in production.
- The messages array, system instruction, tools, and sampling parameters are the same conceptual model across Claude, OpenAI, and Gemini
- The system prompt is load-bearing code — change it and test it the way you would change any business logic
- Native structured output (Claude tool use, OpenAI strict JSON mode, Gemini responseSchema) eliminates the 8–15% JSON parse failure rate
- Temperature 0–0.3 for production classification and extraction; higher for creative generation where variation is acceptable
- 2–5 few-shot examples in the system prompt are the fastest way to improve output consistency without fine-tuning