Behind the Prompt Curtain

Ever wondered what actually happens when you type a prompt into an AI chat and hit send? Not the philosophy of it, not the neural network magic, but the actual nuts and bolts of what travels over the wire?

It is surprisingly simple. And once you see it, a few things click into place that the demos and the marketing materials never quite explain.

The AI has no memory. None.

This is the thing that surprises people the most. Every single AI model you interact with, whether that is ChatGPT, Claude, Gemini, or any of the others, is completely stateless. When you send a message, it has no recollection of anything you have ever said to it before. Not last session, not five seconds ago. Nothing.

What looks like a conversation is actually an illusion your chat client is maintaining.

Every time you send a message, your application bundles up the entire conversation history, every message you sent, every response the model gave back, your original instructions, and ships the whole thing to the AI in one request. The model reads it all from scratch, generates the next response, and forgets everything the moment it is done.

There is no persistent state on the server side, it all lives on your side.

That one fact explains a lot. It explains why AI assistants get confused if you start a new chat window. It explains why long conversations start to drift or degrade. It explains the concept of a context window, that hard ceiling on how much you can send in one go.

What is actually in that request?

The API request to any modern AI model follows what is now a de facto standard, originally defined by OpenAI. Here is the raw shape of it, sent as a plain HTTP POST:

			
const response = await fetch("https://openrouter.ai/api/v1/chat/completions", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${process.env.OPENROUTER_API_KEY}`
  },
  body: JSON.stringify({
    model: "anthropic/claude-sonnet-4.6",
    messages: [
      {
        role: "system",
        content: "You are an assistant who specialises in cloud architecture"
      },
      {
        role: "user",
        content: "What is a load balancer?"
      }
    ]
  })
});
const data = await response.json();
console.log(data.choices[0].message.content);

		

That is it, an array of messages, each with a role and some content. The model string tells the service which AI to route to.

The roles matter:

system is your instructions to the AI. This is where you define its persona, its constraints, what it should and should not do. It runs at the top of every single request.
user is the human turn. What you said.
assistant is the AI’s previous responses. These are the ones you add yourself on the next turn to simulate memory.

Here is how a multi-turn chat session actually works in practice. You maintain the history yourself and grow it with each exchange:

			
import OpenAI from "openai";
const client = new OpenAI({
  apiKey: process.env.OPENROUTER_API_KEY,
  baseURL: "https://openrouter.ai/api/v1"
});
const systemPrompt = {
  role: "system",
  content: "You are a assistant who specialises in cloud architecture."
};
// Start with just the system prompt
const conversationHistory = [];
async function chat(userMessage) {
  // Add the new user message
  conversationHistory.push({
    role: "user",
    content: userMessage
  });
  const response = await client.chat.completions.create({
    model: "anthropic/claude-sonnet-4.6",
    messages: [systemPrompt, ...conversationHistory]
  });
  const assistantMessage = response.choices[0].message;
  // Add the AI's response to history so next turn includes it
  conversationHistory.push(assistantMessage);
  return assistantMessage.content;
}
// Turn 1
console.log(await chat("What is a load balancer?"));
// Turn 2 - the full history goes up with this one
console.log(await chat("How does that differ from an API gateway?"));
// Turn 3 - and again, growing each time
console.log(await chat("Which should I use for a microservices architecture?"));

		

Notice what is happening. On turn three, the model is receiving the system prompt, both previous user messages, both previous AI responses, and the new question. All of it. Every time.

That array is your memory. You own and manage it.

What about tools?

This is the one that trips people up the most, and it is worth being very clear about it.

The AI does not run tools. It cannot. It has no access to your systems, your database, your APIs, or anything else outside of the text you send it. What it can do is ask you to run something on its behalf.

You include a tools definition in your request, which is just a JSON array describing the functions your application exposes. Here is what that looks like:

			
const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4.6",
  messages: [...conversationHistory],
  tools: [
    {
      type: "function",
      function: {
        name: "get_customer_order",
        description: "Retrieves the latest order for a given customer ID",
        parameters: {
          type: "object",
          properties: {
            customer_id: {
              type: "string",
              description: "The unique customer identifier"
            }
          },
          required: ["customer_id"]
        }
      }
    }
  ]
});

		

The name, a plain English description of what it does, and the parameters it expects. That is all you need. The model reads those descriptions and, if it decides one is relevant to answering your question, it stops generating a text response and instead returns a structured tool call, effectively saying “please run this function with these arguments and tell me what you get back.”

Your code receives that, executes the actual function, and then sends a new request with the result added to the message history. The model reads the output and continues from there.

The AI is the decision maker. You are the hands. It observes and interprets, but it never touches anything directly, every tool call goes through your code, which means you control what actually runs, what gets filtered, and what comes back. That is not a limitation, that is the architecture working as intended, and it is why AI tool integrations can be made genuinely safe when they are built thoughtfully.

Same stateless pattern throughout. The history just gains an extra turn containing the tool result.

One thing worth knowing: those tool definitions consume tokens just like everything else. Every request you send includes the full list of tools, and you are paying for every token of that description every single time. If you have a large catalogue of tools, be selective about which ones you include for a given context. Sending fifty tool definitions to an AI that only needs two of them is waste you can easily avoid.

This is worth paying attention to if you are using one of the popular agent frameworks like Hermes or OpenClaw. Both ship with extensive tool libraries, and that is genuinely useful. But out of the box, many of these frameworks stuff the entire catalogue into every single request. You end up paying to describe tools for sending emails, querying databases, calling webhooks, and a dozen other things, on every turn, even when your agent is doing something that needs none of them. The framework looks powerful because it has everything. What it is actually doing is making every call heavier and more expensive than it needs to be. Most of them give you the ability to configure which tools load for which context. Use it.

And of files and web search?

When you upload a file to ChatGPT or ask Claude to search the web, it feels like a native capability baked into the model. Sometimes it is, but mostly it is not.

Files are handled by converting the content, whether that is a PDF, a spreadsheet, or a Word document, into text and embedding it directly into the message you send. You are not uploading a file to the model. You are pasting its contents into the context window. The model never sees a file. It sees text.

This is why the world has suddenly fallen back in love with Markdown. Plain text files with just enough structure to be readable. No binary format, no proprietary encoding, no conversion step. You write it, you send it, the model reads it. Straightforward old text files turn out to be the perfect format for an age where everything eventually becomes a prompt.

Web search works the same way, via the tool pattern described above. The model recognizes that your question needs current information, returns a tool call asking for a search to be run, your code or the platform’s infrastructure runs it, and the results come back as text added to the history. The model reads those results and formulates its response.

Most models do not have web search built in at all. The ones that appear to have it are running it as a tool behind the scenes, exactly as described above. When you build your own integration, you wire up your own search if you need it. There is nothing special about it, it is just another function in your tools list.

The broader point is that everything the AI knows in a given conversation arrived as text in that request. Its training, what it learned before you ever talked to it, and the context window, what you sent it right now, are the only two things it has to work with.

Skills and rules are just more context

The same principle applies to anything else your agent framework injects into a request. Skills, rules, personas, instruction files. Whatever your framework calls them, they all end up in the same place: the context window, consuming tokens before your user has typed a single word.

In Claude Code, for example, your CLAUDE.md file and any skill files under .claude/ get read and inserted into the system prompt at the start of every session. In other frameworks the pattern is identical, just with different file names. The AI reads those instructions the same way it reads everything else. There is nothing magical about them. They are text, and text costs tokens.

A large, sprawling rules file that covers every possible scenario your agent might encounter sounds thorough. What it is in practice is dead weight on every single request, including the ones where most of those rules are completely irrelevant. The discipline of keeping your system prompt lean, and your skill files focused on what is actually needed for the task at hand, is one of the more impactful optimizations you can make. Not just for cost, but for quality. A model given clear, relevant instructions tends to perform better than one buried under pages of caveats and edge cases it does not need right now.

Context compacting, or: the bill is growing

Every token you send costs money, and context windows are not infinite. In a long session, you will hit the ceiling, or you will start paying for it in ways that compound fast.

The practical solution is context compacting. Periodically, you summarize the conversation so far, replace the growing message history with that summary, and carry on. The cleanest way to do it is to make a fresh AI call, pass it the history, and ask it to produce a concise summary you can use as a replacement. That summary becomes the new baseline. You lose some granularity, but you keep the thread.

There are also libraries emerging that manage this locally, tracking token counts and deciding when and what to compress. For most implementations though, a thoughtful summarization prompt every N turns does the job.

One other thing worth knowing. Some models, Claude being the most prominent example, support prompt caching. If the start of your request, typically the system prompt, matches something the model has seen recently, you are not charged the same rate for processing it again.

For production systems with a large, stable system prompt, this is a meaningful saving. You do not need to do anything clever. Send the same system prompt consistently and the caching happens automatically.

Democratizing AI calls

Building directly against one AI vendor locks you into their API, their pricing, their availability, and their model roster. When a better model ships, or your costs spike, or a provider has an outage, you are stuck.

OpenRouter normalizes all of that away. One API key, one endpoint, access to models from Anthropic, OpenAI, Google, Meta, Mistral, and dozens of others. Swap the model string and you are done. They handle the authentication, the billing normalization, and the quirks that different providers handle differently.

The code example above runs against OpenRouter. Changing it to run against a different model is this:

			
model: "openai/gpt-4o"
// or
model: "google/gemini-2.5-pro-preview"
// or
model: "meta-llama/llama-3.3-70b-instruct"

		

One line. That is the entire migration.

I think OpenRouter is one of the more quietly valuable AI companies out there right now. Not because they are doing anything academically impressive, but because they are solving a real logistics problem that every team embedding AI into a product faces. They are the exchange layer. The market makers. The bit that stops you having to care about the plumbing. I am a huge fan of OpenRouter and been building a lot with it of late for clients.

The point of all this

Adding AI to a product, at the technical level, is an HTTP POST request with a JSON body. The system prompt is a string. The conversation is an array. The model is a parameter you can swap at will.

There is no AI infrastructure to set up. No specialist hardware. No complex SDK with a six week learning curve. A junior developer and an OpenRouter account can have a working AI integration in an afternoon.

Which means the technical barrier is effectively gone. That question is answered.

If you have been feeling like you are falling behind, like there is some AI moat forming between the companies who get this and the ones who do not, stop worrying. That moat has gone dry. The playing field here is about as level as it gets.

The question you actually need to spend time on is whether it makes sense. Not can you add AI to this feature, but should you. Does it genuinely improve the experience, or does it add latency and cost to something that a simple dropdown would handle better? Does it give the user something they could not otherwise do, or does it just make the product feel current?

The complexity is not in the plumbing. It never was. It is in knowing what to ask, and what to do with the answer. That applies to the AI, and it applies to the decision to use it in the first place.

The rest is, as they say, is just JSON.

AI Disclaimer: Gemini Nano Banana Pro was used to generate the photos – from the 1939 Wizard of Oz, and 1975 Monty Pythons Holy Grail

Like this:

Latest Posts

Enterprise Stability: Who’s Cutting Your Grass?

Jack Clark talks AI sense

Podcast Episode: Prompting, AI, And Business Chaos

Simulating Corporate Chaos: Your Own Truman Show

AI is Making Us Artificially Intelligent

Categories